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Mixtures of ^-Priors in Generalized Linear Models 


Yingbo Li*and Merlise A. Clyde^ 


Abstract 

Mixtures of Zellner’s ^r-priors have been studied extensively in linear models and have 
been shown to have numerous desirable properties for Bayesian variable selection and 
model averaging. Several extensions of ^(-priors to Generalized Linear Models (GLMs) 
have been proposed in the literature; however, the choice of prior distribution of g 
and resulting properties for inference have received considerably less attention. In this 
paper, we unify mixtures of ^f-priors in GLMs by assigning the truncated Compound 
Confluent Hypergeometric (tCCH) distribution to 1/(1+ 51 ), which encompasses as spe¬ 
cial cases several mixtures of gf-priors in the literature, such as the hyper-^r. Beta-prime, 
truncated Gamma, incomplete inverse-Gamma, benchmark, robust, hyper-g'/n, and in¬ 
trinsic priors. Through an integrated Laplace approximation, the posterior distribution 
of l/{\ + g) is in turn a tCCH distribution, and approximate marginal likelihoods are 
thus available analytically, leading to “Compound Hypergeometric Information Crite¬ 
ria” for model selection. We discuss the local geometric properties of the ^(-prior in 
GLMs and show how the desiderata for model selection proposed by Bayarri et al, such 
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as asymptotic model selection consistency, intrinsic consistency, and measurement in¬ 
variance may be used to justify the prior and specific choices of the hyper parameters. 

We illustrate inference using these priors and contrast them to other approaches via 
simulation and real data examples. The methodology is implemented in the R package 
BAS and freely available on GRAN. 

Keywords: Bayesian model selection, Bayesian model averaging, variable selection, linear 
regression, hyper-gf priors 


1 Introduction 


Carefnl snbjective elicitation of prior distribntions for variable selection, althongh ideal, 
qnickly becomes intractable as the nnmber of variables increases, motivating the need for 
objective prior distribntions that are antomatic and with good freqnentist properties for de- 
fanlt nsage ( Berger and Pericchi||2001 ). In the context of Bayesian variable selection for linear 
models, Zellner’s ^f-prior and, in particnlar, mixtnres of ^f-priors have witnessed widespread 


nse dne to compntational tractability, consistency, invariance, and other desiderata (Liang 


et ah 2008 Bayarri et ah 2012 Ley and Steel 2012) that leads to the preference of these 


priors over many other conventional prior distribntions. 


Zellner (1983, 1986) proposed the ^f-prior as a simple partially informative reference dis- 


tribntion in Ganssian regression models Y = X/3 + e, e ~ N(0,(T^I„), where formnlation of 
informative prior distribntions for regression coefficients (3 has been and remains a challenging 
problem. Throngh the nse of imaginary samples taken at the same observed design matrix X, 
he obtained a conjngate Ganssian prior distribntion f3 N(ho,^), with an infor¬ 
mative mean bo, bnt a covariance matrix that was a scaled version of the covariance matrix of 
the maximnm likelihood estimatoiQ . This greatly simplihed elicitation to two 

qnantities: the prior mean bo of the regression coefficients, for which practitioners often had 


^We follow the now standard notation, however, in Zellner’s papers the prior covariance appears as 
(aV5)(X^X)-i 


2 





















prior beliefs, and the scalar g which controlled both the shrinkage towards the prior mean 
and the dispersion of the posterior covariance throngh the shrinkage factor g/{l + g). 

In Bayesian variable selection (BVS) and Bayesian model averaging (BMA) problems for 
Ganssian regression models with p predictors, every snbset model, indexed by Ad G {0,1}^, 
may be expressed as Y = + X^/3^ + e, where is a column vector of ones of length 

n, a is the intercept, is the model specihc design matrix with pm columns of full rank, 
and is the vector of length p_M of the non-zero regression coefficients in model M.. The 


most common formulation of Zellner’s ^f-prior, as in Liang et ah (2008), uses the independent 
Jeffreys prior for a and 


p{a) oc 1 , 
p(cr^) oc 1 /a^ 


( 1 ) 

( 2 ) 


and a ^f-prior of the form 


/ 3 , 


M 


a' 




( 3 ) 


where is the orthogonal projection on the space spanned by the column 

vector In- While it is often assumed that the columns of the design matrix X^ must be 
centered so that the Fisher information matrix is block diagonal (due to 1 ^X_a 4 = Op^) to 


justify the use of the improper reference priors on the common intercept and variance, Bayarri 
et ah ( 2012 ) argue that measurement invariance, which leads to ([^ and (|^, combined with 
predictive matching, for which Bayes factors under minimal sample sizes do not favor Ad or 
Ad 0 , lead to the form of the ^f-prior above, providing an alternative justification for centering 
the design matrix. 

It is well known that the choice of g affects both shrinkage in estimation/prediction, BVS, 
and BMA, with various approaches being put forward to determine a g with desirable prop- 
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erties. Independent of Zellner, Copas (1983, 1997) arrived at gf-priors in linear and logistic 
regression by considering shrinkage of maximum likelihood estimators (MLEs) to improve 
prediction and estimation, as in Jame-Stein estimators, proposing empirical Bayes estimates 


of the shrinkage factor to improve frequentist properties of the estimators. Related to Co- 


pas, Foster and George (1994) considered risk and expected loss in selecting George and 


Foster (2000) derived global empirical Bayes estimators, while Hansen and Yu (2003) derived 


model specihc local empirical Bayes estimates of g from an information theory perspective. 


Fernandez et ah (2001) studied consistency of BMA under ^f-priors in linear models, recom¬ 


mending g = max(p^,n), which lead to Bayes factors that behave like BIG when g = n or the 
Risk Inflation Griterion (Foster and George 1994) when g = p^. 

Mixtures of ^f-priors, obtained by specifying a prior distribution on the hyper parameter g 
in (|^, including the hyper-^f and related hypei-g/n priors (Liang et al.||2008 Gui and George 


2008), the Beta-prime prior (Maruyama and George 2011), the robust prior (Bayarri et ah 


2012), and the intrinsic prior (Gasella and Moreno 2006 Womack et al.||2014), among others. 


are widely used in model selection and model averaging problems, due to their attractive 


theoretical properties in contrast to gf-priors with fixed g ( 

Liang et ah 

2(I(IN 

Feldkircher and 

Zeugner 2009 Maruyama and George 2011 Geleux et ah 2012 |Ley and Steel 2012 Feld- 


kircher 2012; Fouskakis and Ntzoufras 2013). Mixtures of ^f-priors not only inherit desirable 


measurement invariance property from the ^f-prior but under a range of hyper parameters 


also resolve the information paradox (Liang et ah 2008) and Bartlett’s paradox (Bartlett 


1957 Bindley 1968) that occur with a hxed g, meanwhile leading to asymptotic consistency 


for model selection and estimation (Liang et ah 2008 Maruyama and George 2011; Bayarri 


et ah 2012). Furthermore, by yielding exact or analytic expressions for marginal likelihoods 


in tractable forms, these mixtures of ^f-priors enjoy most of the computational efficiency of 
the original ^f-prior, permitting efficient computational algorithms for stochastic search of the 


posterior distribution over the model space (Glyde et ah 2011). 

For generalized linear models (GLMs), many variants of ^f-priors have been proposed in the 
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literature, including Copas (1983, 1997); Kass and Wasserman (1995); Hansen and Yu (2003); 


Rathbun and Fei 

(2006) 

Marin and Robert 

(2007) 

Wang and George 

(2007) 

Fouskakis et al. 


(2009); Gupta and Ibrahim (2009); Sabanes Bove and Held (2011); Hanson et al. (2014); 


Perrakis et al. (2015); Held et al. (2015); Fouskakis et al. (2016), with current methods favoring 


adaptive estimates of g via mixtures of gf-priors or empirical Bayes estimates of g. While these 
priors have a number of desirable properties, no consensus on an objective prior has emerged 


for GLMs. The seminal paper of Bayarri et al. (2012) takes an alternative approach and 
explores whether a consensus of criteria or desiderata that any objective prior should satisfy 
can instead be used to identify an objective prior, leading to their recommendation of the 
“robust” prior in Gaussian variable selection problems. In this article, we view gf-priors in 
GLMs through this lens seeing if the desiderata can essentially determine an objective prior 
in GLMs for practical use. 

The remainder of the article is arranged as follows. In Section]^ we begin by reviewing g- 
priors in GLMs and corresponding (approximate) Bayes factors, and the closely related Bayes 
factors based on test statistics (Johnson 2005, 2008 Hu and Johnson||2009 ; Held et ^|2015 ). 
As tractable expressions are generally unavailable in GLMs, we focus attention on using an 
integrated Laplace approximation and show that gf-priors based on observed information lead 
to distributions that are closed under sampling (conditionally conjugate). To unify results 
with linear models and ^f-priors in GLMs, in Section]^ we introduce the truncated Gompound 
Gonfluent Hypergeometric distribution (Gordy [l998b ), a flexible generalized Beta distribu¬ 
tion, which encompasses current mixtures of ^f-priors as special cases. This leads to a new 
family of “Gompound Hypergeometric Information Griteria” or GHIG. In Section]^ we review 


the desiderata for model selection priors of Bayarri et al. (2012) and use them to establish 
theoretical properties of the GHIG family, which provides general recommendations for hyper 
parameters. In Section]^ we study the BVS and BMA performance of the GHIG ^f-prior with 
various hyper parameters, using simulation studies and the GUSTO-I data ( Steyerbei^|2009 


Held et al. 2015). Finally in Section]^ we summarize recommendation and discuss directions 
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for future research. 


2 ^-Priors in Generalized Linear Models 


To begin we define notations and assumptions for the generalized linear models (GLMs) under 


consideration. GLMs arise from distributions within the exponential family (McGullagh and 


Nelder 1989), with density 


,YA-b{e,) ,,,,,,, . , 

p(}h)=exp<^---hc(y-,0o)G ^ = l,...,n, 

a(0o) 


(4) 


where a{-),b{-) and c(-, •) are specihc functions that determine the distribution. The mean 
and variance for each observation Yi can be written as E(yi) = b'{9i) and V(yi) = a(0o)&"(6*j), 
respectively, where b'{-) and b"{-) are the hrst and second derivatives of 6(-). In Q, 
are independent but not identically distributed, as their corresponding canonical parameters 
6i,... ,6n are linked with the predictors via 9i = 9{riM,i), where riM,i is the Gth entry of the 
linear predictor 

Vm ~ (^) 

under model Al, providing the “linear model”. Under this parameterization the canonical 
link corresponds to the identity function for 9{-). 

To begin, we will assume that the scale parameter a(0o) = (f^o/w with hxed 0o and w, 
a known weight that may vary with the observation. This includes popular GLMs such as 
binary and Binomial regression, Poisson regression, and heteroscedastic normal linear model 
with known variances. Later in Section we will relax the assumption of known 0o to 
illustrate the connections between the prior distributions developed here and existing mixtures 
of 5 f-priors in normal linear models with unknown precision 0o = and extend results to 

consider GLMs with over-dispersion. 

Unless specihed otherwise, we assume that the design matrix X under the full model 
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has full column rank p and the column space C'(X) does not contain 1„. Furthermore, we 
assume that the true model, is included in the 2"^ models under consideration. Under 
M-Ti true values of the intercept and regression coefficients are denoted by For a 

model Al, if X_a 4 contains all columns of X_a 4 j, (including the case that M. = A4t), we say 
M Z) otherwise, A^ ^ A^t^. The AlLEs are assumed to exist and are unipue. 

Under standard regularity conditions provided in the supplementary materials Appendix |A.1| , 
MLEs are consistent and asymptotically normal. In Section |2.5 we will relax the conditions 
to consider non-full rank design matrices and data separation problems in binary regressions. 

In Bayesian variable selection or Bayesian model averaging, posterior probabilities of mod¬ 
els are critical components for posterior inference, which in the context of gf-priors, may be 
expressed as 


p{M\Y,g) = 


piY I M,g) 'k{M) 


where is the prior probability of model Al, and 



piY \M,g)= I / p(Y I a,Pj^,M)p{a)p{f3j^ \ M,g) da df3j^ 


( 6 ) 


is the marginal likelihood of model Al. In normal linear regression, ^f-priors yield closed 
form marginal likelihoods, which permits quick posterior probability computation and effi¬ 
cient model search, by avoiding the time-consuming procedure to sample a and j3j^. When 
the likelihood is non-Gaussian, normal priors no longer have conjugacy, however Laplace ap¬ 


proximations to the likelihood (Tierney and Kadane 1986 Tierney et al. 1989) combined 


with normal priors such as gf-priors may be used to achieve computational efficiency such as 


in Integrated Nested Laplace approximations (Rue et al. 2009 Held et al. 2015). 
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2.1 in Generalized Linear Models 


There have been several variants of ^f-priors snggested for GLMs, starting with Copas (1983) 
who proposed a normal prior centered at zero, with a covariance based on a scaled version 
of the inverse expected Fisher information evalnated at the MLE of a and f3 = 0. Un¬ 
der a large sample normal approximation for the distribntions of the MLEs, this leads to 
conjugate updating and closed form expressions for Bayes factors. Unlike Gaussian models, 
however, both the observed information Jn{(^M)i which is the negative Hessian matrix of the 
log likelihood, and the expected Fisher information Xn{j3j^) = depend on the 

parameters a and f3, leading to alternative ^f-priors based on whether the expected informa¬ 


tion (Kass and Wasserman 1995; Hansen and Yu 2003 Marin and Robert 2007 Fouskakis 


et ah 2009; Gupta and Ibrahim 2009 Sabanes Bove and Held 2011 Hanson et ah 2014) or 


observed information (Wang and George 2007) is adopted; they are equal under canonical 
links when evaluated at the same values. As these information matrices depend on 
the asymptotic covariance is typically evaluated at either /3^ = 0 or at the model specific 
MLE. For expected information, Xn{j3j^) = X^Z„(?7_yv^)X_M, with Xnijjj^) a diagonal ma¬ 
trix whose Uth diagonal entry under model M. is Xijjj^ i) = —E [9^ logp(17 | rji, Ai)/dri‘f], for 
i = 1,... ,n. When f3j^ = 0, all rji = a under all models, and Xn{r]j^) is equal to I^/c where 
1/c = X{ri) = —E [5^ logp(Y I r], Ai0)/d7f] is the unit information under the null model. The 
resulting ^f-priors have precision matrices that are multiples of X^X_a 4 as in the Gaussian 


case. 


Similar in spirit to Zellner’s derivation of the gf-prior, priors based on imaginary data have 


been developed in the context of GLMs by Bedrick et ah (1996); Ghen and Ibrahim (2003); 


Sabanes Bove and Held (2011); Perrakis et ah (2015); Fouskakis et ah (2016) among others. 


In general, these do not lead to normal prior distributions and typically require MGMG 
methods to sample both parameters and models for BVS and BMA. The ^f-prior introduced 


by Sabanes Bove and Held (2011) and later modihed by Held et ah (2015) adopts a large 



























































sample approximation to justify a normal density: 




(7) 


where imaginary samples are generated from the null model M 0 and the constant c is inverse 
of the unit information given above evaluated at the MLE of a under M 0 . For the normal 
linear regression, c = recovers the usual ^f-prior. 

Under large sample approximations to the likelihood, the ^f-prior in Q permits conjugate 
updating, however, unlike the Gaussian case, evaluating the resulting Bayes factors that 
contain ratios of information matrix determinants among others can increase computational 
complexity, and thus negates some of the advantages that made the ^f-prior so popular in 
linear models. Classic asymptotic theory suggests that X„(/3^) measures the large sample 
precision of /3^, while J7n(/3_yvj) is recommended as a more accurate measurement of the same 
quantity ( Efron and Vinkl^ l978). When the true model Mr 7 ^ -^ 0 , evaluating information 
matrices at the MLE ( Hansen and Yu||2003 Wang and George 2007) may better capture 
the large sample covariance structures of f3j^. This suggests that for GLMs, priors “centered” 
at the null model may lead to ^f-priors that do not adequately capture the geometry under 
model Wl, potentially leading to prior-likelihood conflict and slower rates of convergence. On 
the other hand, using large sample approximations to imaginary data generated from Ai leads 
to a prior distribution for that is not centered at zero, and therefore will not satisfy the 


predictive matching criterion of Bayarri et ah (2012). 


Next, we propose a ^f-prior that incorporates the local geometry at the MLE with the 
objective of providing a prior that satisfies the model selection desiderata, permits analytic 
expressions that lead to both computationally efficient algorithms under large sample approx¬ 
imations to likelihoods, as well as deeper understanding of their theoretical properties. 
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2.2 Local Information Metric ^^-Prior 


The invariance and predictive matching criteria in Bayarri et ah (2012) lead to adoption of ([^ 
for location families. Although the Poisson and Bernoulli families are not location families, 
it is desirable that the prior/posterior distribution for is invariant under any location 
changes in the design matrix X^. In the following proposition, we will use the uniform prior 
in ([^ as a starting point for deriving the (approximate) integrated likelihood for and 
subsequent prior distribution for f3j^. 

Proposition 1. For any model M, with a uniform prior p{a) oc 1, the marginal likelihood 
of f3j^ under model M is proportional to 


p{Y I = J p(Y I a,(3j^,M)p{a)da 

oc p(y \ ~ [f^M- Pm] \ ^ (8) 


where the observed information of a, and f3j^ at the MLEs fjM,i = 


Jn{riM) = diag{di) where di = -Yi 9”{fiM,i) + {b o 6')"(77^1,*) for i = l,. 
Jn 0 M) = “ 'Pi„)XaI, 


,n. 


(9) 

( 10 ) 

( 11 ) 


respectively, and is the perpendicular projection onto 

In under the information Jni.'flM) 'I'oner product, where (u,v)j' = u'^jT’v for u, v G M” and 
a positive definite J. 


The proof of Proposition is given in the supplementary material Appendix A.2 


The approximate marginal likelihood in (|^ is proportional to a normal kernel of /3yv^ with 
a precision (inverse covariance matrix) that is equal to the marginal observed information 
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Jn{$M) ^ function of the “centered” predictors, 



(I 


AJXm 


In In (ln•^^^(^x)I^^) In'^(^yVl) 


Xa, 


( 12 ) 


where the column means for centering are weighted average xj-j = diXij/ with the 

weights proportional to d, in ([^. For non-Gaussian GLMs, diS are not equal, and hence 
this centering step is different from the conventional procedure that uses the column-wise 
arithmetic average. 

This leads to the following proposal for a gf-prior under all models M. 


/3x I S' ~ N (0, g- Jn0M) 


\-i 


(13) 


The advantage of (13) is two-fold: geometric interpretability through local orthogonality. 


which will be illustrated next, and computational efficiency in Bayes factor approximation 


(see Section 2.4). 


Note that we may reparameterize the model (|^ 


'Hm T Xa^/^a^j 


(14) 


where (with apologies for abuse of notation) a is the intercept in the centered parameteriza¬ 
tion. Under this centered parameterization and with p{a) oc 1, the observed information at 
the MLEs is block diagonal, and leads to the same marginal likelihood as in ([^. 

In hypothesis testing, where parameter (3 is tested against a null value /3o with a nuisance 


parameter a, Jeffreys (1961) argues that when the Fisher information is block diagonal for all 
values of (3 and a, improper uniform priors on a can be justihed. This global orthogonality. 


however, rarely holds outside of normal models (Gox and Reid 1987). Under a local alter¬ 


native hypothesis where the true value of [3 is in an 0{n ^/^) neighborhood of /Jg, Kass and 


Vaidyanathan (1992) show that Bayes factors are not sensitive to prior choices on the nuisance 
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parameter, under a weaker condition of null orthogonality, where Xn(a,/3g) is block diagonal 
for all a under the null hypothesis. In particular, under null orthogonality, the logarithm of 
the Bayes factor under the unit information prior for (3 can be approximated by BIC with 


an error of (Kass and Wasserman 1995). For GLMs, the ^f-prior Q implies null 

orthogonality under the centered reparameterization from to (I„ — 

For variable selection, if the true value (3*j^ does not lie in a neighborhood of the null 


value, Kass and Vaidyanathan (1992) point out that the Bayes factor will likely be decisive 


and for practical purposes the accuracy of BIC does not matter. However, for estimation. 


local orthogonality at the MLE, as in the ^f-prior in (13), better captures the large sample 


geometry of the likelihood parameters (q!,/ 3^) than null orthogonality, and as we will see, 
greatly simplihes posterior derivations and theoretical calculations. Under the null model, 
local orthogonality implies null orthogonality asymptotically. 

Note that local or null orthogonalization is not required for a to have a uniform prior as 


m 


Bayarri et ah (2012), but instead the uniform prior leads to the use of the centered X that 


is locally orthogonal to the column of ones under the information inner product and invariant 
under any location changes for the columns of X. For ease of exposition, however, we will 


adopt the centered parameterization in (14) for the remainder of the article 


2.3 Posterior Distributions of Parameters 

Under the ^f-prior (13) on f3j^ and a uniform prior ([^ on a for the centered parameterization 


(14), asymptotic limiting distribution theory (Bernardo and Smith 2000, pp. 287) under a 


Laplace approximation yields the approximate posterior distribntions conditional on Ai as 










-1 


a I Y,A^ ~ N (q:», Jni&M) , 


(15) 

(16) 
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which depend on Y throngh fnnctions of MLEs. Dne to local orthogonality, the posterior dis- 
tribntions of and a are independent. Thns for large n, the marginal posterior distribntion 
of a is proper, althongh its prior distribntion is improper. 

The conditional posterior mean of is shrnnk from the MLE towards the prior 
mean 0 by the ratio g/{l + g), which is nsually referred to as the shrinkage factor for ^f-priors 


in normal linear regression (Liang et ah 2008). Under a different variant of the gf-prior for 


GLMs (j^, the same shrinkage factor g /(1 + g) is obtained by Held et ah (2015), by assnming 
that In{aMy $m) eqnals the block diagonal matrix = 0 ), which approximates the 


expected information when /3^ is in a neighborhood of zero. As discussed in Copas (1983 


1997), for normal linear regression and GLMs, shrinking predicted values toward the center of 


responses, or equivalently, shrinking regression coefficients towards zero, may alleviate over- 


htting, and thus yield optimal prediction performance. Later in Section |5.2[ the GUSTO-I 
data logistic regression example shows that the methods in favor of smaller values of g, i.e., 
smaller shrinkage factors, tend to be more accurate in out-of-sample prediction. 


2.4 Approximate Bayes Factor 


In GLMs, normal priors such as ([^ and (13) yield closed form marginal likelihoods under 
Laplace approximations which are precise to 0{n~^). Under an integrated Laplace approxi¬ 
mation ( Wang and George|2007 ) with the uniform prior on a and ^f-prior in (13) for any model 
At, the approximate marginal likelihood for At and in has a closed form expression 


p(Y lM,g) = / p{Y I (3^,M)p{(Bm \ M,g) df3^ 


EM 


oc p(Y j Kl + 5') exp 


Qm 


2(i + g)j’ 


(17) 


where pm is the column rank of X_a 4 , and 


Qm m) m 


( 18 ) 
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is the Wald statistic (under observed information). For the null model M 0 where pMg = 0, 


Qm 0 = 0 so that (17) still holds. The approximate marginal likelihood (17) is a function of 


MLEs, which is fast to compute using existing algorithms such as the iterative weighted least 


square (McCullagh and Nelder 1989). 


To compare a pair of models Mi and M 2 , the Bayes factor (Kass and Raftery 1995), 


dehned as BFMr.M 2 = I M.i,g)/p{Y \ M 2 ,g), is commonly used in Bayesian model 
selection, assuming the two models are equally likely a priori. If is greater (less) 

than one, then Mi {M 2 ) is favored. When 2^ models are considered simultaneously, under the 
uniform prior 7r{M) = 2“^, comparing their posterior probabilities is equivalent to comparing 
their Bayes factors where each model is compared to a common baseline model, such as the 
null model ( |Liang et al. 2008). With the availability of closed form approximate marginal 
likelihoods 0. the gf-prior ([T^ yields closed form Bayes factors 


BF 


M:M0 


p(yiM,g) 

p(Y I M, 


= exp 


f^M ] 

<.7n(d vi0) 

1 2 J 



' P.M 

(l + g) 2 exp 


Qm 


2(1+ g) 


(19) 


where 


zm = 2 log 


p(Y I M) 


( 20 ) 


P(y I aM0,M0) 

is the change in deviance or two times the likelihood ratio test statistic for comparing model 
M to M 0 . For simplicity, zm will be referred as the deviance statistic for the rest of this 
article. The Bayes factors under the gf-prior provides an adjustment to the likelihood ratio 
test with a penalty that depends on g and the Wald statistic. 


The expression for the Bayes factor in (19) is closely related to the test-based Bayes 


factors (TBF) of Hu and Johnson (2009); Held et al. (2015, 2016) which is derived from the 


asymptotic distribution of zm under the M and M 0 . For GLMs, the TBF of Held et al. 
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(2015) is expressed as 


P ( •y ■ PM. 1 A 

2 ^ 2 ( 1 + 3 )^ 

GPv,;‘rd) 


' PM 

= (1 + ( 7 j 2 exp 


gzM 

.2(1 + 9) 


(21) 


where G{z;a,b) denotes the density of a Gamma distribution of mean a/b, evaluated at 2 ;. 
Under the null or a local alternative where f3j^ is in a neighborhood of of the null, 

the Wald statistic Qm and deviance statistic zm are asymptotically equivalent. In this case, 


replacing Qm by zm in (19) leads to the expression for the TBF. When the distance between 


f3j^ and the null does not vanish with n, we hnd that the TBF exhibits a small but systematic 
bias, but leads to little difference in inference for large g = n, where they are close to BIG. In 
Section using simulation and real examples, we hnd that with g = n, TBF and DBF have 
almost identical performance in model selection, estimation, and prediction. More discussions 
and an empirical example with TBF are available in the supplementary material Appendix 


2.5 When MLEs Do Not Exist 

Before turning to the choice of g and other properties, we briehy investigate the use of g- 


priors (13) when MLEs of aj^ or do not exist. Two different cases are considered: data 
separation in binary regression, and non-full rank design matrices for GLMs in general. 

For binary regression models with a hnite sample size, data separation problems may cause 


serious issues (Albert and Anderson 1984; Heinze and Schemper 2002 Ghosh et al. 2015) 


For of full rank, the data exhibit separation if there exists a scalar 70 G M and a non-null 
vector 7 = ( 71 ,..., 7 p^)'^ G such that 

7 o -7 > 0 if Ui = 1 , 7 o x ^_.7 <0 if = 0 , for alH = 1 ,..., n. ( 22 ) 


In particular, there is complete separation if in (22) strict inequalities hold for all observations. 
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In the absence of complete separation, there is quasi-complete separation if (22) holds with at 


least one quasi-separated sample for which the equality holds. By Albert and Anderson (1984), 


in the presence of quasi-complete separation, there exists a non-empty set of observations 


Q C {1,... ,n} that can only be quasi-separated by all (70,7) pairs that satisfy (22). For the 
design matrix ^m,q formed by these observations, its rank qj^ = rank(X_A4,Q) < Pm^ because 
X;k is full rank and columns of (1 ,X;k^q) are linearly dependent. 

If there is complete or quasi-complete separation, then MLEs exist, i.e.. 


they tend to ± infinity (Albert and Anderson 1984) and MLEs of probabilities are on the 


boundary of the parameter space in binary regression. 

The following proposition summarizes results for Bayes factors, for the two most commonly 
used binary models, logistic and probit regressions. 

Proposition 2. For both logistic and probit regression models, under model M, 

(1) If there is complete separation, then the observed information 0 has di¬ 


agonal elements that are all zero, the g-prior in (13) is not proper and the Laplace 


approximation is no longer valid for approximating the Bayes factor. 


(2) If there is quasi-complete separation, the rank of the precision matrix of (13) is q^ 


i.e., the g-prior has a singular precision matrix unless qM = Pm, the Bayes factor 


formula (19) is bounded. 


The proof is available in supplementary material Appendix A.4 Under complete separa¬ 
tion, the gf-prior in (13) violates the “Basic Criterion” in Section 4.1 While the ^f-prior ([^, 
which depends on the covariance structure under the null, is well defined in the presence of 
data separation and leads to bounded Bayes factors as the expected information under M. 
is not used in the Laplace approximation, its posterior estimates of probabilities inherit the 
instability of the MLEs. 

Design matrices that are not full rank also lead to identihability problems with MLEs of 
aM and j3j^ for all GLMs. Consider a model M. where rank(X;vj) = r_M < Pm, and a full 
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rank design matrix X^vj/ that contains rj^ colnmns and spans the same colnmn spaces as X^vj, 
i.e., C(X_A4) = C(X_A4'). Althongh the MLE of the coefficients are not all nniqne, MLEs 
of the linear predictors fiM,i are nniqne; in fact, 


Vm ~ + X_a 4/3 _v( — In^M' + ^M'P. 


M' 


(2.3) 


and JJni'HM) nniqne and positive definite. The precision matrix of the ^f-prior (13), 
Jn{$M) = Well-defined, however, since rank(X^) = rank(X;K) = < 

Pm, its inverse does not exist dne to singnlarity. Note that the nnll-based ^f-prior (|^ snffers 
from a similar singnlarity problem. 

We may extend the definition of g priors to inclnde singnlar covariance matrices by adopt¬ 
ing generalized inverses in defining the ^f-prior. Becanse of the invariance of orthogonal pro¬ 
jections to choices of generalized inverse and nniqneness of the MLE of r/^, we have the 
following proposition regarding the Bayes factors in models that are not fnll rank. 


Proposition 3. Suppose rank(X.M) = i"m < Pm, ^hen 
p{Y\M,g) 


BF 


M-.M0 


p{Y I M,) 


= exp 


(Zm} 

fJrXpiM^ 

12 / 



(1 + ^/)- 


LM. 

2 exp 


Qm 


^il + g) 


(24) 


If FA' is a full rank model whose column space C'(X_a 4 ') = C'(X_a 4), then Qm = Qm', = 
zm' , ond BFm:M' ~ 1- 


The proof is available in snpplementary material Appendix A.5 Here the two models 
Ai and Ai' have the same Bayes factor if their design matrices span the same colnmn space. 
This form of invariance is not possible with other conventional independent prior distribntions, 
snch as generalized ridge regression or independent scale mixtnres of normals. While posterior 
means of coefficients nnder BMA will not be well defined, predictive qnantities nnder model 
selection or model averaging will be stable, however, care mnst be taken in assigning prior 
probabilities over eqnivalent models. 
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2.6 Choice of g 


Problems with fixed values of g prompted Liang et al. (2008) to study data-dependent or 


adaptive values for g. This includes the unit information prior where g = n (|Kass and 


Wasserman 1995), and local and global empirical Bayes (EB) estimates of g (Copas 1983 


1997 Hansen and Yu 2001, 2003 Liang et ah 2008 Held et ah 2015) 


For the local EB, each model Ai has its own optimal value of g that maximizes its marginal 
likelihood: 

9m^ = argmax p(Y \M,g), 

9>0 

and the local EB estimator of the marginal likelihood is obtained by simply plugging in the 
estimator: \ M.) = p(Y | 

For example, under the ^f-prior (13), [Hansen and Yu (2003) derive 


'LEB / Qm t 1 

g^ = max ( — -1,0), 


which has a similar format to g^^ = max(2:_M/px — 1) 0), its counterpart for the test-based 
marginal likelihood under the ^f-prior ([^, derived by Held et al. ( 2015| ). 

The global EB involves only a single estimator of g, based on the marginal likelihood 
averaged over all models g^^ = argmaXg>o p{-M.)p(Y \ Ai,g). The global EB estimator 


may be obtained via an EM algorithm when all models may be enumerated (Liang et ah 2008), 


but is more difficult to compute for larger problems (Held et ah 2015). For the remainder of 


the article, we will restrict attention to the local EB approach. 

The EB estimates of g do not lead to consistent model selection under the null model 


(Liang et al. 2008) although provide consistent estimation. Mixtures of ^f-priors provide an 


alternative that propagate uncertainty in g with other desirable properties. 
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3 Mixtures of ^-Priors 


Liang et al. (2008) highlight some of the problems with using a hxed value of g for model 


selection or BMA and recommended mixtures of gf-priors that lead to closed form expressions 


or tractable approximations. In order to consider the model selection criteria of Bayarri 


et ah (2012), we propose an extremely flexible mixture of gf-priors family that can encompass 


the majority of the existing mixtures of ^f-priors as special cases. Furthermore, utilizing 
Laplace approximations to obtain (|^, it yields marginal likelihoods and (data-based) Bayes 
factors in closed form, for both GLMs (|^, and extensions such as normal linear regressions 
with unknown variances and over-dispersed GLMs. This tractability permits establishing 
properties such as consistency. 


3.1 Compound Confluent Hypergeometric Distributions 

The parameter g enters into the posterior distribution for and the marginal likelihood 


(17) through the shrinkage factor g/{l + g) or the complementary shrinkage factor u = 


1/(1 -|- g). Since the approximate marginal likelihood depends on g in the format of u, 
p(Y I Ai,u) oc exp(—MQ_yK/2), a conjugate prior for u (given ^o) should contain the 

kernel of a truncated Gamma density with the support u G [0,1]. Beta distributions are also 


natural prior choice for u, such as the hyper-^f prior of Liang et al. (2008). Other mixtures of 


gf-priors such as the robust prior (Bayarri et al. 2012) and the intrinsic prior (Womack et al 


2014) truncate the support of g away from zero, so the resulting u has an upper bound strictly 


smaller than one. 

To incorporate the above choices in one unihed family, we adopt a generalized Beta distri¬ 
bution introduced by Gordy ( |1998b ) called the Gompound Gonfluent Hypergeometric distri¬ 
bution, whose density function contains both Gamma and Beta kernels, and allows truncation 
on the support through a straightforward extension. We say that u has a truncated Gompound 
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Confluent Hypergeometric distribution if m ~ tCCH(t, q, r, s, v, n) with density expressed as 


p{u I t,q,r,s,v,K) = 


exp(s/n) 


M* ^(1 — vuy 


B{t,q) ^i{q,r,t + q,s/v,l — k) [k + {1 — k)vuY 


(25) 


where parameters t > 0,q > 0,r G M, s G M, n > l,K > 0. Here, <hi(a,/9,7, x, y) = 
J2m=o^n=oi^)m+nW)nX"'y"'/[{'y)m+nm\n\] is the coufluent hypergeometric function of two 
variables or Humbert series ( Humbei^|1920| ) , and (a)„ is the Pochammer coefficient or shifted 
factorial: («)„ = 1 if n = 0 and («)„ = r(Q; + n)/T{a) for n G N. Note that the parameter v 
controls the support of u. When n = 1, the support is [0,1]. When n > 1, the upper bound of 
the support is strictly less than one, which may accommodate priors with truncated g. This 
leads to conjugate updating of u as follows: 


Proposition 4. Let u = 1/{1 + g) have the prior distribution 


u ~ tCCH[ |,^,r, 1,-1;,^ 


(26) 


where a, 6, k > 0, r, s G M and n > 1, then for GLMs with a fixed dispersion (pQ, integrating 
the marginal likelihood in (0 with respect to the prior on u yields the marginal likelihood for 
Ai which is proportional to 


p{Y \ M) (X M&m) 

B (57*1,1) (|,r, !!±5±£ii, y2M_ 1 _ k) 


(27) 


where pm is the rank ofl^M, Qm is given in (18). The posterior distribution of u under 
model M is also a tCCH distribution 


M I Y,M ~ tCCH 


f a +Pm b s + Qm 
-^^- 


,n,K 


(28) 
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allowing conjugate updating under integrated Laplace approximations. 


The proof is available in supplementary material Appendix |A.6[ 


Corollary 1. The Bayes factor for comparing Ai to A4.0 is 






V 2 exp 




Ci-\-h-\-p j\/[ s~\~Qa4 

2v 


2v j B(|4)4,(|,r,^,A,l-K) 


, l-ft) 


and depends on the data through the deviance zm and the Wald statistic Qm- 

We refer to the model selection criterion based on the Bayes factor above as the “Confluent 
Hypergeometric Information Criterion” or CHIC as it involves the confluent hypergeometric 
function in two variables and the gf-prior is derived using the information matrix; the hierar¬ 
chical prior formed by 0, 0 and ([2^ will be denoted as the CHIC ^f-prior. 


In the conjugate updating scheme (28), the parameter a and s are updated by the model 


rank p_M and the Wald statistic Qm , respectively, while none of the remaining four parameters 
are updated by the data. The parameters a/2 and 6/2 play a role similar to the shape 
parameters in Beta distributions, where small a or large 6 tends to put more prior weight on 
small values of u, or equivalently, large values of g. We will show later that a also controls 
the tail behavior of the marginal prior on (Bm- The parameter v controls the support, while 
parameters r, s, and n “squeeze” the prior density to left or right ( Gordy|1998b ). In particular, 
large s skews the prior distribution of u towards the left side and in turn favoring large g. 
Table [T] lists special cases of the CHIC ^f-prior and corresponding hyper parameters that have 
appeared in the literature. The last column indicates whether the model selection consistency 


holds for all models which will be presented in Section |4.3[ We provide more details about 
these special cases in the next sections. 
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Table 1: Special cases of the CHIC ^f-prior with hyper parameters and whether the prior 
distributions lead to consistency for model selection under all models. If no, the models 
where consistency fails are indicated. 



a 

h 

r 

s 

V 

K 

Consistency 

CH 

a 

b 

0 

s 

1 

1 

If 6 = 0 ( 77 ,) or s = 0 (? 7 ,) 

Hyper-g 

1 

2 

0 

0 

1 

1 

No, M 0 

Uniform 

2 

2 

0 

0 

1 

1 

No, M 0 

Jeffreys 

0 

2 

0 

0 

1 

1 

No, M 0 

Beta-prime 

1 

2 

n-pM- 1-5 

0 

0 

1 

1 

Yes 

Benchmark 

0.02 

0.02 max(n,p^) 

0 

0 

1 

1 

Yes 

ZS adapted 

1 

2 

0 

n -I- 3 

1 

1 

Yes 

Robust 

1 

2 

1.5 

0 

n+l 

PAd + 1 

1 

Yes 

Hyper-g / n 

1 

2 

1.5 

0 

1 

1 

n 

Yes 

Intrinsic 

1 

1 

1 

0 

_Pa^ + 1_ 

n+PM+^ 

n 

Yes 


3.2 Special Cases 


Confluent Hypergeometric (CH) prior. The Confluent Hypergeometric distribution, pro¬ 


posed by Gordy ( |1998a ) is a special case of the CHIC family and is a generalized Beta 


distribution with density 


M* ^(1 —^exp(—su) 


where t > 0, g > 0, s G M, and iFi(a,6, s) = r(fc^a)r(a) lo ~ exp{sz)dz is the 

Confluent Hypergeometric function ( |Abramowitz and Stegun||1970 ). Based on this distribu¬ 
tion, we propose the CH prior by letting u have the following hyper prior 


M ~ CH 


a b s 
2 ’ 2 ’ 2 


(29) 
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under which the posterior for u is again in the same family, and p{Y \ M) has a closed form 


M I Y,Af ~ CH 


a + pM b s + Qm 


(30) 


P ( a+PM b'] fp ( a+PM a+b+pM s+Qm t 

p(Y I >1) oc p (y I - " ^-’ 

^ ^ ^12’ 2 ) 1^1 V2’ ~ 2 ) 

under the integrated Laplace approximation. 

Similar to the CHIC p-prior, small a, large b, or large s favors small u a priori, with a con¬ 
trolling the tail behavior. In model selection, preference for heavy-tailed prior distributions 


can be traced back to Jeffreys (1961), who suggested a Cauchy prior for the normal location 


parameter to resolve the information paradox in the simple normal means case. The follow¬ 
ing result shows that the CH prior has multivariate Student t tails with degrees of freedom 
a, and in particular, the choice a = 1 leads to tail behavior like a multivariate Cauchy. 

Proposition 5. Under the CH prior, the marginal prior distribution p{(3j^ \ Ai) has tails 
behaving as multivariate Student distribution with degrees of freedom a, i.e., 


a+PJ\A 

lim p(/3a^ I-M) oc (||/3^||^J ^ 

Wpm 11 “^°° 


where \\(3^\\ = and ||/3^b„ = Pl^JnCpM)pM 


A proof is available in supplementary materials Appendix |A.7[ While the CH prior has only 
half of the number of parameters as the CHIC p-prior, it remains a flexible class of priors for 


u G [0,1]. In particular, when s = 0, (29) reduces to a Beta distribution, and when 6 = 2, it 
reduces to a truncated Gamma distribution. For the CH prior, we let parameter a be hxed, 
and parameters b and s be either hxed, or on the order of 0{n). The CH prior, and thus 
the CHIC ^f-prior, encompass several existing mixtures of ^f-priors as follows: 
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Truncated Gamma prior (Wang and George 


2007 Held et al. 2015) 


u ~ TG(o,i) {at, St) p{u) = 




l{at,st) 




(31) 


with parameters at,St > 0 and snpport [0,1]- Here 'y{a,s) = is the incom¬ 

plete Gamma fnnction. This is eqnivalent to assigning an incomplete inverse-Gamma prior 
to g. The trnncated Gamma prior permits conjngate npdating in GLMs: n | Y,A1 ~ 


TG(o,i) (a* + PA4/2, St-1-Q^/2). When at = l,St = 0, (31) rednces to a nniform prior on 

u. 


Held et al. (2015) introdnce the ZS adapted prior by letting at = 1/2, St = (n -|- 3)/2, 


so that the resnlting prior on g matches the prior mode of Zellner and Siow (1980) prior 
t7~IG(l/2,rt/2). 


Hyper-g prior (Liang et al. 2008; Gni and George 2008): 


u ~ Beta (-^ — 1,1), where 2 < a/i < 4 


(32) 


with defanlt valne ah = 3. When o/j = 4, (32) rednces to a uniform prior on u. The choice 


ah = 2 corresponds to the Jeffrey’s prior on g, which is an improper prior and will lead to 


indeterminate Bayes factors if the nnll model is inclnded in the space of models. Gelenx et al. 


(2012) avoid this by exclnding the nnll model from consideration. The hyper-^f prior (32) 


can also be expressed as a Gamma distribntion trnncated to the interval [0,1], and hence 
has conjngate npdating in GLMs, 


(^-1,0) 


M I Y,Af ~ TG 


( 0 , 1 ) 


Pm + 


- 1 , 


Qm 


(33) 


Beta-prime prior (Marnyama and George 2011) 


T n-pM - 1-5 

u ~ Beta I -,- - - 
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which is equivalent to a Beta-prime prior on g. The second parameter was carefully chosen for 


normal linear models to avoid evaluation of the Hypergeometric 2 F 1 function (Abramowitz 


and Stegun 1970, eq 15.3.1) in marginal likelihoods. 


Benchmark prior (Ley and Steel 2012) 


u ~ Beta (c, c ■ max(n, , 


which induces an approximate prior mean E(p) max(n,p^) (Fernandez et ah 2001). The 
recommended parameter value is c = 0.01. 


Robust prior (Bayarri et ah 2012) is a mixture of p-priors with the following hyper prior 


Pr{u) = ttr [Pribr + «)]" 


U' 


CLr — 1 




(34) 


where a,. > 0, 6^ > 0 and pr > hr/{hr + n) and is a special case in the CHIC family. The 
upper bound of its support l/[pr{hr -l- n) -|- (1 — hr)] < 1. Hence, the robust prior does not 


include the CH prior (29) as a special case, and vice versa. 


In normal linear models, the robust prior yields closed form marginal likelihoods, which 
contain a rarely used special function, the Appell Fi function ( Weisstein||2009 ). Similarly in 
GLMs, evaluation of the special function 'Ll is required. Based on the various criteria for 
model selection priors, default parameters = 0.5,6,. = 1, and = 1/(1 + Pm) are recom¬ 


mended (Bayarri et al. 2012), under which the prior (34) reduces to a truncated Gamma, 


with a conjugate updating: 


u 


TG 


( 0 -^) i 2’ ° 


M I Y, Ad ~ TG/„ 

n + 1 > 


Pm + 1 Qm 


(35) 
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and with marginal likelihood proportional to 


p(Y I >1) oc p (Y|dA^,/3_y^,, ) J7n(dx) 2 

Qm 


1 / n + 1 


PAd+1 

2 


■ 7 


Px + 1 

Pm + 1 Qm{Pm + 1 ) 

2 ’ 2(n +1) 


( 36 ) 


Comparing (33) and (35) reveals an interesting finding: the robnst prior can be viewed as 


a truncated hyper-^f prior, with an upper bound increasing with pm and decreasing with n. 


In fact, the robust prior includes the hyper-p prior (32), and also the following hyper-p/n 
prior as special cases. 


Hyper-g/n prior (Liang et ah 2008): 


P{9) = 


(Xh 2 


2n \1 + g/n 


O'hl'i 


, where 2 < o/j < 4. 


Intrinsic prior (Berger and Pericchi 1996 Moreno et ah 1998 Womack et ah 2014) is 


another mixture of p-priors that truncates the support of g. It has the hyper prior: 


n 


9 = 


1 I 1 

■ —, tc ~ Beta - 

Pm + ^ w \2 2 


Under the intrinsic prior, the parameter g is truncated to have an lower bound n/{pj^ + 1), 
which corresponds to an upper bound of u to be {pm + l)/(n + pm + 1). As shown in Table 
the intrinsic prior is also in the CHIC family. 


3.3 Unknown Dispersion 

For normal linear regressions with unknown variances, special cases of the CHIC ^f-prior, such 
as the hyper-gf. Beta-prime, benchmark, and robust priors yield closed form Bayes factors, 
although they may require evaluation of special functions such as the Gaussian Hypergeomet¬ 
ric 2^1 or Appell Fi. Liang et al.[ (2008) show that under the ^f-prior 0 -(§. the marginal 
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likelihood conditional on g (or u) is 


p{Y\M,g) = 


p{Y\M,) {1 + gy-^ 

n — l 

|i + 9(i--Ri,)l^ 


p(Y I M., u) = 


p{Y I M^) u-^ 


[(1 - Hi,) + 


n — l * 
2 


(37) 


Under the general tCCH prior (26), the marginal likelihood p(Y | A4) = p(Y | u, M.)p{u)du 
lacks a known closed form expression, however, it is analytically tractable under all the special 


cases discussed in Section 3.2, leading to the following. 


Proposition 6. For normal linear regression with unknown variance where W is an 

n X n diagonal weight matrix, let be the coefficient of determination under the weighted 
regression. Under the prior distributions p{a,a‘^) oc 1/a^, (3 ~ Y(0, and 

the tCCH prior on 1/{1 + g), 

(1) If r = 0 (or eguivalently, k = 1), then 


piY I M) = 


niv \ KA \ R (9FEM 6') <n f k 77-1 a+b+PM J_ ^ 

P[Y \JVl0)lA\y 2 ) 2 / 1 ^ 2 ’ 2 ’ 2 ^ 2V-’ FRFUfFjf) 


P.M 




(38) 


(2) If s = 0, then 


p(Y I M) = 


p{Y\M0) 


PM 


rn-Riy-r 8(1,1) ,F,(r,i,^,i- k) 


(39) 


r. i'O' + Pm a + b + Pm + I - n - 2r n-1 a + b + pM , 

■ Fi 1 —^^^-; i - r, i - ^ - 


R\^k 


(1 - Rldv 


A proof of Proposition is provided in supplementary material Appendix A.9 along with 


a brief summary of relevant special functions in supplementary material Appendix A.8 Note 
that (1) applies to the CH prior and all its special cases, and (2) applies to both robust and 
intrinsic priors. 

Similarly, under a wide range of parameters, the CHIC ^f-prior also yields tractable 
marginal likelihoods for the double exponential family (West 1 19'^ Efron 1986), which permits 
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over-dispersion in GLMs by introducing an unknown dispersion parameter 0: 


p{Yi I Oi, (j)) = I 9i)'*’p{Yi \ei = tiY i = 


( 40 ) 


where p(Yi \ 6i) follows the GLM density (|^, and the constant U = arg maxg. p(yi | 9i). 
An ideal feature of this over-dispersed GLM is that the MLEs do not depend 

on (j). Furthermore, the observed information of remains to be block diagonal 

Jn,<t, (^cuMy^M) = diag|0Jn(dA4),0Jn(;9A4)} ) where Jn(dx) and Ju^m) are the observed 
information matrices for ordinary GLMs as in (10) and Therefore, the GHIG gf-prior 

can be modihed easily to account for over-dispersion 


I S',-Ad ~ N 0, - ■ Jnif^M) > p(«) pi.^) ^ > 


and provides closed form approximate marginal likelihoods after integrating out 0 


piY I A4, u) oc 




+ 2Er=i Y,{U-9{)-h{U) + h{9{) 


n — 1 
2 


(41) 


A derivation of (41) is provided in supplementary material Appendix A. 10 Since ( [4l| ) contains 
the same type of kernel function of u as (37), there exists a similar result to Proposition 


that for all special cases of the GHIG family in Section 3.2 marginal likelihoods are tractable 
after integrating out u. 

The GHIG ^f-prior provides a rich and unifying framework that encompasses several com¬ 
mon mixtures of ^f-priors. However, this full six-parameter family poses an overwhelming 
range of choices to elicit for an applied statistician. As many of the parameters are not up¬ 


dated by the data, we appeal to the model selection criteria or desiderata proposed by Bayarri 


et ah (2012) to help in recommending priors from this class. 
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4 Desiderata for Model Selection Priors 


Bayarri et al. (2012) establish primary criteria that priors for model selection or model aver¬ 


aging should ideally satisfy. 


4.1 Basic Criterion 


The basic criterion requires the conditional prior distributions p{f3j^ \ A4,a) to be proper, 
so that Bayes factors do not contain different arbitrary normalizing constants across different 


subset models (Kass and Raftery 1995). This criterion does not require specihcation of a 


proper prior on a, nor orthogonalization of a (Bayarri et al. 2012). For the gf-prior (13), 


under any model Ai, as long as the observed information is positive-dehnite, the 

prior distribution p{(3j^ \ g,Ai) is a normal distribution, and hence the basic criterion holds. 
It also holds under mixtures of ^f-priors for any proper prior distribution on g. The basic 
criterion eliminates the Jeffreys prior on g, unless the null model is not within consideration. 


4.2 Invariance 


Measurement invariance suggests that answers should not be affected by changes of measure¬ 


ment units, i.e., location-scale transformation of predictors. Under the ^f-prior (13), the prior 
covariance on is proportional to Jni’flM) '^%\\ design matrix is rescaled 

to X_a 4 D, where D is a positive dehnite diagonal matrix, then the normalized design X^ 
becomes X(^D, and coefficients are rescaled to Since the MLE fjj^ remains the 

same, the prior distribution on is invariant under rescaling. Furthermore, the prior on 
/3_v( is also invariant under translation, since shifting columns of X^k does not change 
or X^. The uniform prior on a ([^ combined with the CHIC ^f-prior ensures that the prior 
on is invariant under linear transformations. For models with unknown variance, the 
reference prior on in ([^ ensures invariance under scale transformations. 
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4.3 Model Selection Consistency 


Model selection consistency ( Fernandez et al.||2001 ) has been widely nsed as a crncial criterion 
in prior specification. Based on Bayes rnle nnder the 0-1 loss, a prior distribntion is consistent 
for model selection if as n —)■ oo, the posterior probability of Air converges in probability to 
one, or eqnivalently, the Bayes factor tends to infinity 


p{Mt I Y) 


BF_a4j,:x —> OO, for all M. ^ A4t, 


nnder fixed p and bonnded prior odds p{M.t)/ p{Ai)■ For normal linear regressions, Zeller- 


Siow, hyper-gf/n, and the robnst priors have been shown to be consistent (Liang et ah 2008 


Bayarri et ah|^012[), while for GLMs, the Zeller-Siow and hyper-gf/n priors based on the 


nnll based ^f-prior in ([^ have been shown to be consistent (Wn et ah 2016). We establish 
consistency for special cases of the CHIC ^f-prior in Table [1} 


Theorem 1. When Mt 7 ^ model selection consistency holds under the robust prior, the 
intrinsic prior, the CH prior, and the local EB g-prior. When Mt = M 0 , consistency still 
holds under the robust prior, the intrinsic prior, and the CH prior with b = 0{n) or s = 0(n), 
but not under the local EB. 


The proof is available in snpplementary materials Appendix |A.11[ Note that for the CH 
priors, the result also holds if the parameters a, b, s are model specific (for example, the 
parameters in the Beta-prime prior depends on Pm)- As revealed in Table among the 
mixtures ^f-priors, model selection consistency holds under all but the three hyper-^f prior 
variants, where consistency fails under the null model. Priors that are globally consistent 


imply prior choices oi g = 0{n), which will be discussed in Section 4.5 This corresponds 
to flatter priors on which imposes enough penalty on model sizes, so that the selection 
consistency holds even when A4t = M 0 . 
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4.4 Information Consistency 

In normal linear regression, with a fixed sample size n > pm + 1, the information consistency 
fails under the ^f-prior (|^ with fixed g (Liang et ah 2008), in the sense that the Bayes 


factor (37) is bounded when model Ai fits all observations perfectly, i.e., = 1 


or F —)■ oo, although in principle it should favor Ai overwhelmingly over Ai 0 . Bayarri 


et ah (2012) reformulate the information consistency as follows: If there exists a sequence of 


datasets with the same sample size n such that the likelihood ratio between Ai and AI 0 goes 
to infinity, then their Bayes factor should also go to infinity. 

GLMs with categorical responses such as binary and Poisson regressions, have likelihood 
functions based on probability mass functions, which have a natural upper bound 1, so that 
even under data separation for binary data, the likelihood ratio remains bounded, and hence 
information consistency is not an issue for these GLMs for any prior that satisfies the basic 
criterion. 


4.5 Intrinsic Consistency 

The intrinsic consistency suggests that as n increases, the limit distribution of p{l3j^ \ a. At) 
should be independent of n and remain proper, instead of degenerating to a point mass 
2012). By Lemma in the supplementary materials, Jni^M) — Op{n) if 


(Bayarri et al. 


M D Aip, so with any fixed value of g, the ^f-prior rtl3| depends implicitly on n, and reduces 


to a point mass at zero asymptotically. Hence in the ^f-prior or mixtures of ^f-priors, the choice 
g = 0 {n) is essential to prevent the ^f-prior from dominating the likelihood. 

The intrinsic consistency is shown to hold under the robust prior, since the prior density 


of g/n does not depend on n in the limit (Bayarri et al. 2012). In this sense, other existing 


priors such as the unit information prior {g set to be n), Zellner-Siow, hyper-g/n, and intrinsic 
priors also satisfy the intrinsic consistency. On the other hand, for some mixtures of ^f-priors, 
whose induced prior densities p{g/n) lack closed forms, an implicit version of the intrinsic 
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consistency that states E(l/ 5 f) = 0{l/n) can be stndied. This implicit intrinsic consistency 


is shown to hold under the Beta-prime prior (Maruyama and George 2011). We show that it 


also holds under the CH prior in the following proposition, with certain hyper parameters. 


Proposition 7. Under the CH prior, if the parameters b = 0{n) or s = 0{n), then the prior 
expectation K{l/g) = 0 {l/n) as n goes to infinity. 


The proof is provided in supplementary materials Appendix A.12 In contrast, the ^f-prior 
with hxed g, the hyper-^f prior and its special cases are eliminated due to their g = 0(1) 
choices. Note that for the CHIC family, the intrinsic consistency and the previously discussed 
model selection consistency hold under the same conditions. 


4.6 Estimation Consistency 

Parameter estimation is an essential part of regression analysis, with or without model selec¬ 
tion. When Airp is known and AIt ^ Ai 0 , one detractor of the gf-prior with hxed g is that the 
approximate posterior mean ¥.[13\ Y, g, ATt] = + 9)Pmt —^ h!9 )I^*Mt remains 

biased asymptotically as n tends to inhnity. For mixtures of ^f-priors, since the distribution of 
g adapts to the data, a sufficient condition to resolve this asymptotic bias is for the posterior 
distribution of the shrinkage factor z = g/{l + g) to converge to 1 in the limit. 

Proposition 8. For the CH, robust, and intrinsic priors, when M.t 7^ the characteristic 
function of the conditional posterior distribution z = g/{l + g) under Mt converges in prob- 
ability to that of a degenerate distribution at 1 , i.e., for any t G M, 0z|Y,x-r(^) ~ —> 

exp(it). Therefore, all moments of p{z \ Y,^^) converge to 1 in probability. In particular, 
the posterior mean E( 2 ; | Y,M.t) — > 1 and the posterior variance V(z | Y,M.t) —0. 

The proof is given in supplementary materials Appendix |A.13 

When M-t is unknown, one may prefer Bayesian model averaging (BMA) estimators to 
account for model uncertainty. In BMA, (3 denotes the p dimensional vector of coefficients 
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corresponding to all potential predictors, while is typically length pm vector of the nonzero 
coefficients. With a slight over-use of notation, we let f3j^ denote the length p vector, with 
zeros hlled for the dimensions not included in A4. The posterior of (3 under BMA is thus 


p{P\Y)=p{Mt\Y)p{P^^\Y,Mt)+ Y. p{M\Y)p{(3j^\Y,M) (42) 

where conditional posterior distributions | Y,Ai) = f p{f3j^ \ Y,g,Ai)p{g \ Y,Ai) dg 

p 

for all subset models Ai ^ When the selection consistency holds, i.e., p{M.t \ Y) —)■ 1, 


the second term in (42) vanishes in the limit, so we just need to study the posterior distribution 


of /3_v(y- When Mr = M. 0 , if the selection consistency fails, consistency of the MLEs yields 
the correct estimation of the true parameter = 0, with or without shrinkage. 

Theorem 2. For the CH, robust, and intrinsic priors, the characteristic function of the 
posterior distribution under BMA p{f3 \ Y) converges in probability to that of a degenerate 


distribution at i.e., for any t G W, 0 / 3 |Y(t) 




't , In particular, the mean and 


covariance of the posterior distribution of (3 under model averaging have limits E(/3 | Y) 


(3Y and Y{(3 \ Y) 


0 . 


A proof is given in supplementary materials Appendix A. 14 Note, this estimation con¬ 
sistency for /3 also implies estimation consistency for r/ and functions of rj. 


4.7 Predictive Matching 

Predictive matching is viewed as one of the most crucial aspects for objective model selection 
priors as improper scaling of priors may have critical consequences for comparing models in 


high dimensional problems (Bayarri et al. 2012). Jeffreys suggests that when comparing two 


models with minimal sample sizes where one should not be able to discriminate between them, 
the Bayes factor should be close to one. In particular, exact predictive matching occurs if it 


equals one. The minimal training sample is dehned by Bayarri et al. (2012) as the smallest 
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sample size with a finite nonzero marginal density for the combination of models and priors. 
For normal linear models with nnknown variance, the minimal sample size is 2 (or the nnmber 
of parameters in the nnll model) and exact predictive matching occurs under the CHIC g- 
priors. For GLMs with known dispersion, the minimal training sample size would be 1. The 
asymptotic approximations of course do not apply in such a case, however, for a minimal 
sample size and a model for which ^ 0 but is not identifiable, the results from 

Proposition establish that exact null predictive matching holds under the CHIC ^f-prior. 


5 Examples 


We explore properties of the priors in finite samples for logistic regression via simulation stud¬ 
ies under a range of sparsity scenarios. Results from Poisson regression reveal similar findings 
to the logistic simulation study, and are included in supplementary material Appendix We 


then turn to a re-analysis of the GUSTO-I data considered in Held et ah (2015) to illustrate 


the methodology and compare prior distributions for estimation of posterior inclusion prob¬ 
abilities and out-of-sample predictive performance. The R package BAS, available on CRAN, 
is used for all computations in this section. 


5.1 A Simulation Study 

We conduct a simulation to explore properties of the priors for model selection and estimation 
in logistic regression using p = 20 and p = 100 predictors and under different designs for X. 
For each simulated dataset, we take n = 500 with the columns of X drawn from standard 
normal distributions, which have pairwise correlation cor(Xj,Xj) = for 1 < i < j < p, 
with r = 0 (independent design) or r = 0.75 (correlated design). We consider four different 
levels of sparsity in the true model (see Table for p = 20. For p = 100, we consider 
only the sparse scenario where pmt = 5, with additional coefficients /3X^j,,2i:ioo = 0- 
p = 20, we enumerate among all 2^° subset models using a uniform distribution over the 
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model space, p(M) = 1/2^, which assigns every models equal prior weights. For p = 100, we 


use the MCMC algorithm in Clyde et ah (2011) with 2^' 131,000 iterations. In addition 


to the uniform prior, we also consider the Beta-Binomial(l, 1) prior over the model space, 
p(M) = (p + 1)~^( ^ which is recommended for multiplicity adjustment in Bayesian 


variable selection for large p as it puts uniform weights on model sizes 0,1,... ,p (Ley and 


Steel 2009) and encourages sparsity when pmt ^ p/2- 


Table 2: Values of intercept and coefficients /3X<j,) in the true models in the logistic 

regression simulation study with p = 20, where b = (2, —1, —1,0.5, —0.5)^. 


Scenario 

PAiT 




/^Xt,11:15 

16:20 

Null 

0 


0 


0 


0 

0 

Sparse 

5 

-0.5 

b 


0 


0 

0 

Medium 

10 

b 


0 


b 

0 

Full 

20 


b 


b 


b 

b 


For model selection, we select the model with the highest posterior probability (or the 
smallest AIC, BIG) under a 0-1 loss. Table displays the number of times AA.t is selected in 
100 simulations under each scenario, while Table |C.1| in the supplementary materials shows 
the average size of the selected models. The fully Bayes methods can be roughly divided 
into two groups according to their prior concentration preference: g = 0{n) and g = 0(1). 
The g = 0{n) group, including all the special cases of the CHIC prior that satisfy model 
selection and intrinsic consistency (see Table [^, lead to more parsimonious models, and 
hence outperform the rest of the methods in scenarios where the full model is not true, while 
the g = 0(1) group, including the hyper-^f prior and its special cases, are more accurate only 


when the full model is true. These result also conhrm the theoretical hndings in Section 4.3 


and in Liang et al. ( 2008[ ), that the priors on g independent of n are not consistent for model 
selection when A4t = mQ Interestingly, the hypei-g/n prior, although in the g = 0{n) 
group, performs closer to the hyper-^f prior variants, when the full model is true, or when 


^Since the Jeffreys prior is improper, when implementing it, the null model is always excluded. 
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p = 100. For the unit information prior, i.e., the ^f-prior with g = n, the DBF and TBF yield 


almost identical results, which is also noted by Held et ah (2015) and provide results that are 
intermediate. Both can outperform mixtures of ^f-priors in the g = 0{n) group when the true 
model is sparse, but may not perform as well as them when M.t is the null model or the full 
model. 

Among non-fully Bayesian methods, the local EB tends to favor large models, which is 


also noted in Hansen and Yu (2003). When Air = A40, it never selects the correct model 
but surprisingly almost always selects the full model (average model size is 19). Between AIC 
and BIC, the former favors larger models while the latter favors smaller ones. BIC performs 
comparably to priors in the g = 0(n) group as long as Mt is not the full model. 

The prior distribution over the model space also leads to signihcant difference. When 
p = 100 and Pmt ~ under most gf-priors and mixtures of gf-priors, the Beta-Binomial(l, 1) 
prior favors sparser models than the uniform prior, leading to more accurate model selection 
results. However, it is the opposite case with the hyper-g/n prior, the three hyper-^f variants, 
and the local EB, for which the average model sizes are large (around 70) under the uniform 
prior, but even larger under the Beta-Binomial prior (close to 100). This phenomenon can 


be explained by the symmetric U-shaped density curve of the Beta-Binomial prior (Scott 


and Berger 2010, Fig 1) — where the null model and the full model have the highest prior 


probabilities, among all individual models. For methods that lead to marginal likelihoods that 
favor model sizes larger than p/2, the Beta-Binomial(l,l) prior does not necessarily promote 
sparsity and may encourage selection of the full model. 

Estimation and prediction are often more important than identifying the true model, 
particularly for large p. To evaluate the performance for parameter estimation, we report 
SSE(/3) = in Table where jSj represents the posterior mean estimates 

under BMA (here /3o corresponds to the intercept a); while for AIC and BIC, this is the 
MLE under the selected model. An overall trend is that the methods perform better in model 
selection generally yield smaller estimation errors. One exception is the g = 0(1) priors 
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Table 3: Logistic regression simulation example: number of times the true model is selected 


out of 100 realizations. Column-wise maximum is in bold type. 


V 

20 

100 

p{M) 

Uniform 

Uniform 

BB(1,1) 

PMt 

r 

0 

0 

0.75 

0 

5 

0.75 

0 

10 

0.75 

0 

20 

0.75 

0 

5 

0.75 

0 

5 

0.75 

CH(a = 1/2,6 = n) 

92 

88 

61 

29 

38 

8 

6 

0 

11 

11 

61 

6 

CH(a = 1,6 = n) 

85 

82 

60 

30 

37 

8 

6 

0 

15 

9 

61 

6 

CH(a = 1/2,6 = n/2) 

86 

84 

46 

28 

30 

12 

8 

0 

3 

2 

62 

6 

CH(a = 1,6 = n/2) 

70 

73 

45 

30 

30 

11 

8 

0 

8 

4 

63 

6 

Beta-prime 

92 

88 

61 

29 

38 

8 

7 

0 

11 

6 

61 

6 

ZS adapted 

85 

82 

60 

30 

37 

8 

6 

0 

8 

11 

61 

6 

Benchmark 

91 

93 

28 

31 

19 

8 

16 

0 

6 

3 

62 

6 

Robust 

86 

83 

41 

29 

29 

10 

8 

0 

4 

1 

52 

5 

Intrinsic 

76 

77 

40 

29 

26 

10 

8 

0 

2 

3 

56 

5 

Hyper-p/n 

77 

73 

37 

31 

23 

7 

16 

0 

0 

0 

1 

0 

DBF, g = n 

73 

79 

67 

29 

31 

2 

0 

0 

68 

26 

55 

3 

TBF, g = n 

73 

79 

67 

29 

31 

2 

0 

0 

68 

27 

55 

3 

Jeffreys 

NA 

NA 

28 

28 

17 

7 

16 

0 

0 

0 

1 

0 

Hyper-p 

6 

9 

25 

29 

15 

8 

16 

1 

0 

0 

0 

1 

Uniform 

2 

5 

23 

24 

14 

6 

18 

1 

0 

0 

0 

0 

Local EB 

0 

0 

25 

29 

15 

7 

16 

1 

0 

0 

0 

0 

AIC 

3 

7 

5 

9 

13 

5 

12 

0 

1 

2 

63 

15 

BIG 

73 

79 

67 

29 

31 

2 

0 

0 

67 

28 

55 

3 


and the local EB, which have small SSE under the null despite their poor model selection 
performance. 

We also examined the out-of-sample classification error for logistic regression which re¬ 
vealed almost no difference across methods. 


5.2 GUSTO-I Study 

We use a publicly available subset of the GUSTO-I datc0 ( |Steyerbei^|2009 ; Held et aL||2015 ), 
containing n = 2188 patients to illustrate the methodology for predicting a binary endpoint 


of 30 day survival for myocardial infarction. We use the same p = 17 predictors as in Held 


et ah (2015), labeled in the same order. 


^This dataset is available on the book website http://www.clinicalpredictionmodels.org 
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CH(a=1/2,b=n) 

CH(a=1,b=n) 

CH(a=1/2,b=n/2) 

CH(a=1,b=n/2) 

Beta-prime 
ZS adpated 
Benchmark 
Robust 
Intrinsic 
Hyper-g/n 
DBF, g=n 
TBF, g=n 
Jeffreys 
Hyper-g 
Uniform 
Local EB 
AlC 
BIG 

XI X3 X5 X7 X9 X11 X13 X15 X17 



Figure 1: Marginal posterior inclusion probabilities for the GUSTO-I data. The colors are 
related to the magnitude of the inclusion probability with darkest blue corresponding to one 
and red to zero, while 0.5 is shown as white. 
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Table 4: Logistic regression simulation example: 100 times the average SSE = “ 

of 100 realizations. Column-wise minimum is in bold type. 


p 

20 

100 

p(M) 

Uniform 

Uniform 

BB(1,1) 

PMt 

r 

0 

0 

0.75 

0 

5 

0.75 

0 

10 

0.75 

20 

0 0.75 

0 

5 

0.75 

0 

5 

0.75 

CE{a = 1/2,b = n) 

3 

3 

21 

44 

51 

96 

94 

184 

109 

135 

26 

78 

CH(a = l,b = n) 

3 

4 

21 

43 

51 

96 

94 

183 

119 

139 

26 

77 

CH(a = 1/2,6 = n/2) 

4 

5 

22 

43 

50 

92 

87 

172 

158 

182 

26 

75 

CH(a = l ,6 = n/2) 

4 

5 

22 

43 

50 

92 

86 

172 

160 

189 

27 

74 

Beta-prime 

3 

3 

21 

44 

51 

96 

94 

183 

123 

142 

26 

78 

ZS adapted 

3 

4 

21 

43 

51 

96 

94 

183 

121 

144 

26 

77 

Benchmark 

4 

7 

21 

44 

49 

89 

73 

158 

169 

195 

26 

75 

Robust 

4 

5 

23 

44 

52 

91 

90 

165 

252 

292 

193 

139 

Intrinsic 

4 

6 

23 

44 

52 

91 

90 

165 

239 

284 

143 

90 

Hyper-g/n 

3 

4 

21 

43 

48 

88 

72 

158 

197 

226 

441 

326 

DBF, g = n 

3 

3 

20 

47 

54 

117 

113 

244 

42 

65 

27 

82 

TBF, g = n 

3 

3 

20 

47 

54 

117 

113 

245 

42 

65 

27 

83 

Jefireys 

2 

3 

22 

45 

50 

89 

74 

159 

212 

231 

444 

387 

Hyper- 5 ( 

2 

3 

22 

45 

51 

90 

76 

160 

219 

233 

451 

396 

Uniform 

2 

2 

22 

46 

52 

91 

78 

161 

230 

236 

459 

411 

Local EB 

1 

1 

22 

45 

50 

89 

74 

158 

245 

236 

608 

434 

AIC 

8 

15 

29 

51 

59 

93 

103 

158 

287 

353 

39 

71 

BIG 

3 

3 

21 

47 

55 

117 

113 

245 

42 

65 

27 

82 


Figure illustrates heatmaps of the marginal posterior inclusion probabilities (pip) for 
each of the 17 predictors under enumeration of all 2^^ possible models in the model space 
using a range of priors on g and the uniform and Beta-Binomial(l,l) prior distributions on 
the model space. For AIC and BIG we use exp(—AIC/2) and exp(—BIC/2), respectively, in 
the place of the approximate marginal likelihood to calculate posterior model probabilities. 

Figure shows that the predictors X 2 , X 3 , X 5 , Xe, Xie have high inclusion probabilities 


under all methods, reinforcing the Endings in Held et ah (2015). Comparison across difierent 
methods reveals the same trend as supported by theory and in the simulation studies: the 
g = 0{n) group and BIG lead to sparser models than the g = 0(1) group, local EB, and AIC. 
Within the g = 0{n) group, the unit information prior, under either DBF or TBF, yields 
the most parsimonious model, while the benchmark and hyper-^f/n priors tend to select more 
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predictors, leading to results that are more similar to the g = 0(1) group. As with the 
simulation study, the Beta-Binomial(l, 1) does not automatically favor sparser models where 
inclusion probabilities are higher for a number of variables even in the g = 0{n) group 
compared to the uniform prior. 


To explore out-of-sample predictive performance, we use bootstrap cross-validation (Fu 


et ah 2005) to evaluate predictions under BMA. For each of the 1000 bootstrap datasets, it 


is obtained via sampling with replacement, with the same sample size n = 2188. We fit the 
models on the bootstrap samples, and then study prediction using the left out samples, whose 
sample size is about one-third of n. As in Held et ^ ( |2015 ), we summarize performance using 
the area under ROC curve (AUC), calibration slope (CS), and logarithmic score (LS), and 
also include the Brier score, i.e., the average squared difference between jl and Y. Among 
these measurements, AUC and CS closer to one indicate better discrimination and calibration, 
respectively, while smaller LS suggests better discrimination and calibration, and smaller Brier 
score indicates more accurate predictions. Table shows that overall the methods perform 
similarly, with methods that prefer denser models in selection, such as the benchmark, hyper- 
g/n, hyper-gf, local EB, and AIC, slightly outperforming the others. In particular, the uniform 
prior on u (a special case of the hyper-^f prior) yields the most accurate prediction under all 
four summaries. Over the model space, the uniform prior slightly outperforms the Beta- 
Binomial(l, 1), in terms of AUC, CS, and LS. 

One potential explanation for the better performance of the g = 0(1) and the local EB 


is that shrinkage is better calibrated to the data by avoiding over-htting (Copas 1983). As 
the shrinkage factor g/{l + g) increases with g^ the g = 0(1) priors and the local EB tend to 
impose stronger shrinkage than the g = 0{n) priors. For the GUSTIO-I dataset, the BMA 
posterior estimate of g is 14.7 for the uniform prior on u, 16.5 for hyper-^f, 18.4 for local 
EB, 24.0 for benchmark, 25.4 for hyper-g/n, 50.0 for ZS adapted, 286.5 for intrinsic, 298.1 
for CH(a = 1,6 = n, s = 0), 319.6 for Beta-prime, and 321.2 for robust prioij^ Comparing 


^For all special cases of the CHIC g-prior, the posterior estimates of g are converted from the approximate 
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Table 5: Prediction accuracy for the GUSTO-I data, aggregated from 1000 bootstrap cross 
validation sets. Bold font marks the largest AUC, the CS closest to one, and the smallest LS 
and Brier score. _ 



AUG 

GS 

LS 

Brier 

p{M) 

Unif 

BB(1,1) 

Unif 

BB(1,1) 

Unif 

BB(1,1) 

Unif 

BB(1,1) 

GH(a = 1/2,6 = n) 

0.8346 

0.8338 

0.9055 

0.9065 

0.1848 

0.1851 

0.0497 

0.0497 

GH(a = 1, 6 = n) 

0.8347 

0.8339 

0.9054 

0.9063 

0.1848 

0.1851 

0.0497 

0.0497 

GH(a = 1/2,6 = n/2) 

0.8349 

0.8343 

0.9054 

0.9049 

0.1846 

0.1849 

0.0496 

0.0497 

GH(a = l,6 = n/2) 

0.8349 

0.8343 

0.9054 

0.9048 

0.1846 

0.1849 

0.0496 

0.0497 

Beta-prime 

0.8346 

0.8338 

0.9055 

0.9065 

0.1848 

0.1851 

0.0497 

0.0497 

ZS adapted 

0.8345 

0.8329 

0.9338 

0.9382 

0.1846 

0.1854 

0.0496 

0.0498 

Benchmark 

0.8352 

0.8347 

0.9292 

0.9251 

0.1841 

0.1842 

0.0495 

0.0495 

Robust 

0.8349 

0.8344 

0.9012 

0.8998 

0.1847 

0.1849 

0.0496 

0.0497 

Intrinsic 

0.8350 

0.8344 

0.9010 

0.8993 

0.1846 

0.1849 

0.0496 

0.0497 

Hyper-gf/n 

0.8352 

0.8346 

0.9287 

0.9265 

0.1841 

0.1842 

0.0495 

0.0495 

DBF, g = n 

0.8338 

0.8325 

0.9100 

0.9126 

0.1852 

0.1857 

0.0498 

0.0499 

TBF, g = n 

0.8338 

0.8325 

0.9101 

0.9126 

0.1852 

0.1857 

0.0498 

0.0499 

Jeffreys 

0.8352 

0.8346 

0.9392 

0.9373 

0.1840 

0.1841 

0.0495 

0.0495 

Hyper- 5 ( 

0.8352 

0.8346 

0.9446 

0.9429 

0.1839 

0.1840 

0.0495 

0.0495 

Uniform 

0.8352 

0.8346 

0.9502 

0.9485 

0.1839 

0.1840 

0.0495 

0.0495 

Local EB 

0.8352 

0.8346 

0.9391 

0.9373 

0.1840 

0.1841 

0.0495 

0.0495 

AIG 

0.8351 

0.8344 

0.8813 

0.8645 

0.1846 

0.1850 

0.0495 

0.0496 

BIG 

0.8338 

0.8325 

0.9096 

0.9122 

0.1852 

0.1857 

0.0498 

0.0499 


these estimates with the data likelihood of g marginalized over the model space p{Y \ g) = 
I M,g)p(A4 I g), we find that estimates of g from the g = 0(1) priors, local EB, 
benchmark, and hyper-g/n priors are closer to the peak 20 of the marginal likelihood 
(see Figure [^. On the other hand, as noted by Ley and Steel (2012), the robust and intrinsic 
priors, which truncate the range of g above (n—p_^)/(p_^ + l) > 120.6 and n/(pj^ + l) > 121.6, 
respectively, may not be well supported by the data, when n is large and p is small like the 
GUSTO-I data. 


conditional posterior means of u = 1/{1 + g), which have closed form expressions. These estimates of g are 
computed under the uniform prior on models p{M.) = 1/2^. 
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Figure 2: Marginal likelihood of g for the GUSTO-I data (n = 2188 and p = 17). 


6 Conclusion 


In this article we introduced CHIC ^f-priors, a flexible family of mixtures of ^f-priors derived 
from the Compound Confluent Hypergeometric distribution that encompasses the majority of 
mixtures of ^f-priors used in practice as special cases. Under a wide range of hyper parameter 


choices, CHIC gf-priors satisfy various desiderata proposed by Bayarri et al. (2012). For model 


selection where sparse models are often expected, based on both theoretical and empirical 
studies, we recommend priors with the choice g = 0{n), such as the CH prior with b = 0(n) or 
s = 0(n), Beta-prime, ZS adapted, benchmark, robust, intrinsic, and unit information priors. 
For prediction, all methods yield similar accuracy and are asymptotically consistent, with 
the local EB, hyper-gf, benchmark, and hyper-g/n priors which favor larger models slightly 
outperforming the rest of the g = 0 (n) group, even though the model selection consistency 
criterion does not hold under the local EB and hyper-^f priors when Mt = M. 0 - Because 
model selection and prediction are two unaligned goals with different objective functions 


(Copas 1983), it is not surprising that no single priors overwhelmingly outperform others for 


both goals. Similar to Ley and Steel (2012), we would also recommend the benchmark and 


hyper-priors for general practitioners, due to their balanced performance in selection and 
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prediction. 

A primary advantage of the CHIC ^f-priors is that marginal likelihoods are available in 
tractable forms under the integrated Laplace approximation, requiring only simple summaries 
from GLMs, hence the CHIC ^f-prior has the same computationally complexity as model htting 
for GLMs, leading to efficient algorithms for variable selection and model averaging under 
enumeration. As p increases (e.g., larger than 35) and enumerating the entire model space 


becomes impractical, stochastic search algorithms (see Clyde et al. (2011); Garcia-Donato 


and Martmez-Beneito (2013) and the references therein) can be employed, while avoiding 


computationally expensive model search alternatives such as the reversible jump MCMC 
( Green||1995 ), as Bayes factors can be directly computed without sampling the model specihc 
parameters. All of the methods used in the examples and simulation studies within this article 


are implemented in the R package BAS (Clyde 2016) available on CRAN 


Several extensions of the current mixture of ^f-priors in GLMs are possible. In this paper, 
the number of predictors p is assumed hxed. While we have established that ^f-priors are well 
dehned in the case of non-full rank designs (including the case pm > n), the second-order 
Laplace approximation of the marginal likelihood is not precise enough. Under canonical 


links, a correction factor derived based on a sixth-order Laplace approximation (Raudenbush 


et al. 2000) can be readily applied to mixtures of ^f-priors and local EB estimates, as the 


correction factor does not depend on g (Sabanes Bove and Held 2011). This may lead to 


improved approximations to marginal likelihoods at little increase in computational cost. 

One of the motivations for using the observed information in dehning the CHIC gf-prior 
is that it lead to analytically tractable expressions for studying the asymptotic properties 
and for comparing with other methods. While the CHIC ^f-priors satisfy the desiderata, an 
exception is complete separation in binary regression. This is not an issue in the ^f-priors 


based on the information matrix under the null (Sabanes Bove and Held 2011; Held et al. 


2015), which combined with Metropolis-Hastings algorithms would provide valid inference in 


this case. As a practical solution, the addition of pseudo-observations such as in Bedrick 
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et al.| ( 

1996 

) based on a ridge-like prior ( 

Gupta and Ibrahim 

2009; 

Baragatti and Pommeret 


2012) may be incorporated as part of the design, with the ^f-prior based on the augmented 


design. This may provide an efficient computational algorithm to explore potential models as 
exploratory data analysis although theoretical properties of the combined prior would need 


to be established. Independent priors on regression coefficients such as Ishwaran and Rao 


( |2005[ ); [Ghosh and Clyde] ( |2011[ ); [Johnson and Rossellj ( [2012[ ); [Rockova and George] ( [2015] ) 
often have better performance for estimation when predictors are highly correlated. These 
ridge and generalized ridge priors, however, are not invariant within a model under all linear 


transformations. Bayarri et ah (2012) hnd that the invariance property is necessary for 
predictive matching, suggesting that this criterion may not hold under generalized ridge priors, 
although they are invariant of location-scale changes under standardization of all predictors. 

Finally, while the information paradox does not arise in GLMs with categorical data, this 
is an open question for other continuous GLMs without a dispersion parameter and may help 
in further elucidating restrictions on hyper parameters in the CHIC family. 


Supplementary Materials 

Appendix [^ a list of assumptions, all the proofs, and some additional theoretical results. 

Appendix [^ discussion and an empirical example on the test-based Bayes factor. 

Appendix [^ a Poisson regression simulation example, and additional results on the logistic 
regression simulation example. 
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Mixtures of ^-priors in Generalized Linear Models 
Supplementary Materials: Appendiees 


A Assumptions, Theoretical Results, and Proofs 


A.l Assumptions and Regularity Conditions 

The following assumptions and standard regularity conditions are used throughout the paper. 

For functions b(-) and 9{-) in the GLM density Q, their third derivatives exist and are 
continuous on M. The composite function b' o 6{-), which links E(y) and rj, is strictly 
monotonic. The variance function b" o 6{-) > 0, and the equality can only occur on the 
boundary ±cx). 

Finite MLEs aM,$M exist and are unique, under all subset models M.. 


The design matrix X under the full model is known and has a full column rank p. Here, p 
is fixed. The column space C'(X) does not contain 1„. When studying asymptotics, we 
assume that for i = 1 ,... ,n the norm of the zth row ||xj ||2 is bounded by a constant, 
and for all n, the smallest eigenvalue of X^X/n is bounded from below by a positive 
constant. These conditions assure weak consistency (convergence in probability) and 
asymptotic normality for MLEs (Fahrmeir and Kaufmann] 1985). 


The true model A4t is among the 2^ subset models to be selected under consideration. In 
A4t, true values of the intercept and regression coefficients are denoted by 
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A.2 Proof of Proposition 

Proof. We first approximate the likelihood by a second order Taylor expansion at the MLE, 


p(Y I 

p(Y I At) 


■ exp 


( 


T r 

I 1 

OL — 0!ji4 



Pm ~ Pm 



^A^d7n(t)_yv()X_v( 

\T T 


O. — Cij^ 

f^M ~ f^M 


= p{Y I Al)exp \--{a - om + m) (l^Jn(^_^)ln) {a - om + m) 

~2 ~ ^ {^(^M ~ I ! 


where m = ^ (l^J'„(77^)X^) (^f3^ - and 

^ ~ ’^'M'^ni'n m)'^M ~ (X^iA(^_yv()ln) (ln<i^rt(^x)ln) • 


In the above approximate likelihood, the matrix $ acts like a precision matrix of /3^. By 
nsing the projection Vi^ = (l^j7n(f7;v()l) ^ rewrite it as 

~ ~ 'Pin) <Ai(f7x)(In ~ ’Pl„)X_A4 = ffni^M)- 


Under the flat prior p{a) oc 1, integrated Laplace approximation gives the marginal like- 
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lihood density conditional on /3^; 


p{Y I = / piY I a,(3j^,M)p{a)da 


OC p(Y I >l)exp |-^ Jn0M) [f^M- 

■ j exp |-^ (a - + m)'^ [l^Jn{f}M)'^n) (a - d^^ + m)| da 

OC p(Y I aM,^M^^) [ln^n(^A^)ln]~"exp|-^ (^/3^ - ^m) (A 





□ 


A.3 Asymptotic Performance of the Observed Information 

Lemma 1. For any subset model M, 

(1) ifAi D Air, then Jn{aM) = Op{n) andJ’ni^M) ~ Op{n). More specifically, 

p '' P 

Xn{aM)/n —^ 0 , and J'„(/3_^)/n - X„(/3_y^,)/n —)■ 0 . 

(2) if M fi) Mt, then = Op{rF^) and JJ^m) = Op{n'^^), where 0 < r_M < 1- 


Proof. First, we study the asymptotic of MLEs. The assumptions on the design matrix of the 
full model X remain to hold for the design matrix X_a 4 under all subset models, i.e., are 
bounded for alH = 1 ,..., n, and as n tends to infinity, the smallest eigenvalue of X^X_A 4 /n 
is bounded from below by a positive constant. Since these are stronger than the condition 
Rc in Fahrmeir and Kaufmann ( |1985 , pp. 355), we have weak consistency and asymptotic 
normality for MLEs under any M. D M.Ti he-, as n —)■ cx). 



[a 


M 




{Pm ) ^ ( Pm P. 


'Mt 


N(0,I 


VM^ 


(A.l) 


Here, a*j^ = and P*j^ = P*j^.^ in the sense that all entries in P*j^ that correspond 

to predictors not in Aip are hlled with zero. Therefore, if D AIt, then r/jhj i = (^m + 
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= <^Mt + ^MT,if^*MT = VMT,i^ alH = 1,..., n. On the other hand, li M Mt, 


Self and Manritsen (1988) and van der Vaart (2000, pp. 45, Theorem 5.7) suggest that the 


limits of MLEs still exist, i.e., ^ but the linear predictors in the 

limit ^ 

Since under non-canonical links, observed information matrices contain Y, we need weak 
law of large numbers for independently but non-identically distributed random variables. In 


Resnick (1999, pp. 205), by Theorem 7.2.1 and the proof of special case (a), we have that for 
a sequence of independent random variables Yi,... ,Yn, if their variances are bounded, then 
as n —>■ oo. 


n .- iL 

n n 

i=l 2=1 


0 . 


(A.2) 


Then we show the asymptotic results on Jn{otM)- In ([^, for i = 1,..., n, the ith diagonal 
entry of JuiVu) can be rewritten as d* = b” o + [b' o - ^7] 9"{r]M,i)- 

Hence, for any model Al, 


11 1 

(d_,Vl) ln77 (flyVf) I-n ^ ^ dj 
n n n ^^ 

i=l 


n 

p 1 


n 

p 1 


n 


b” o 9{fiM,i) + [b' o 9{fiM,i) - Yi\ 9”{fiM,i) 

2 = 1 
n 

b” o 9{^M,i) WiVM,i)f + [b' o 9{fiM,i) -b' O 9{ri*_^^ ^)\ 9"{f)M,i 

2 = 1 
n 


(A.3) 


2 = 1 


where the second last line is given by (A.2) and the fact ]E(Yj) = b' o 9{ri^^^), for all i = 

1.. .. ,n, and the last line is given by the continuous mapping theorem. Since for all i = 

1.. .., n, Xj is bounded, so - and are also bounded. For each term in the summation 

of ( |A.3 ), it is bounded due to the continuity assumptions on the third derivatives of b(-) and 
d(-). Therefore, Jn{oiM)/n is bounded in probability. 
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If M D Aixi (|A.3[) becomes 


11'^ 1 

-M&m) a ° OilMr,) [^'WMrjY = 

1=1 


(A.4) 


which is also the limit of X„(a_Aa)/''^- Because we assume that b' o $[■) is strictly monotonic, 


so 6{-) is also strictly monotonic. For each term in the summation of (A.4), it is positive 


because 9'{-) ^ 0 and b” o 9(^1]) is positive for hnite r]. Therefore by ( |A.4 ), if A4 D A4r, then 
is positive and bounded in probability, i.e., Jni^Ai) = Op{n). On the other hand. 


if M. ^ M-t, then only (A.3) holds but not (A.4). Each term in the summation of (A.3) 


can be either positive, zero, or negative. In this case, by (A.3), J7n(ax)/n is bounded in 


probability, and it may equal to zero. Therefore, Jn^O-M) is on the order of 0{in7^), where 
Tn < 1, so that it tends to oo at a rate no faster than Op{n). 

Last, we show the asymptotic results of the matrix 


jA0m) = = XL(I» - - PlJX, 


M 


= X 


M 


J{Vm) - ^^ {Vm) 


Xa,. 


For the (j, /c)th entry, f < j < k < p_\ 4 , 


-1 


n l 


Jn0, 


Mj 


' k n ^ ^ ^ f ^ ^ diXi jj f ^ ^ dij f ^ ^ djXjA 


^ ^ diXijXi^k ( ^ '^2 ) ( ^ '^2 '^2 ^*^*’*: 

2=1 \ 2=1 / \ 2=1 / \ 2 = 1 


is bounded since all x* are bounded. Therefore, i® bounded in probability. 

To show that for any M. D Adr, reduce to zero, we will show that 

it is a positive definite matrix. For any given non-zero vector a G we denote X^^a = 


{ti ,..., tn)'^, whose entries are all bounded. When Ad D Ad^, by (A.4), all dj’s have a positive 
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lower bound, hence simple calculation gives 


1 1 " /I ” 

= - V - - V di 

n n \ n 

i=\ \ i=l 


-1 


n 


^ diU I > 0. 


2=1 


Here the quality only holds if all tiS are equal for i = 1,... ,n, which is impossible here because 
of the assumption ^ C(Km)- large n, the assumption that the smallest eigenvalue of 
X.^'K./n being bounded from below by a positive constant suggests that is positive 

dehnite, so sJ'j'n{l3j^)sL/n 0. 


Furthermore, arguing similarly to (A.4), we also have 


n 




2=1 


2=1 


for fc = 0,1, 2. Therefore, for any vector a, if Ad D -Mr, then 


1 

n 


a^X(^x)a 


^a^X„0^)a 



0 . 


i.e., Jn{(3j^)/n and Xn{l3j^)/n are asymptotically the same. □ 

A.4 Proof of Proposition 

To show Propositionabout the prior precision Jni^jvd/di examine the behavior of 

Ju^Vm) ^Fe following lemma. Recall that JnijlM) ^ diagonal matrix, whose ith diagonal 
entry is d{7]M,i), where the function d{-) is defined as 


d{v) 


d^hg f{Y I r]) 

dr^'i 


Lemma 2. For both logistic and probit regression models, if rj tends to oo or —oo (when the 
corresponding Y is 1 or 0, respectively), then d{r]) —)■ 0. In addition, d{r]) = O for 

logistic regression, and d{ri) = O (|? 7 |e“^ j for probit regression. 
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Proof. For logistic regression, 


d{ri) = 


exp{r]) 


[l + exp{T])f' 

Hence, as rj tends to oo or —oo, d{r]) —)■ 0, at the rate O 
For probit regression. 


d{ri) = 


[^(h) - y] {<P'{v)^{v) [1 - ^iv)] - Hv? [1 - 2$(??)]} + [1 - $(17)] 

[1 - 


where $(■) and 0(-) are the cdf and pdf of the standard normal distribution, respectively. If 
rj —)■ 00 , then $( 77 ) —> 1 , H = 1 , and 


d{if]) = 


0 '(h) ^ [1 - ^ 




$(77) $(77)2 [1 - $(77)] $(77) [1 - $(77)] 

- , i, , + , , OC rje 


1 — $( 77 ) 1 — $( 77 ) 


If T] 


-OO, then $( 77 ) —;■ 0, H = 0, and 


d{ri) = 


0 '(h) [1 - 2 <h(r/)] 


+ 


(f){rif 


l-$( 77 ) $(77) [1 - $(7/)]^ <F( 77 ) [1 - $(77)] 


0'(77) - 


0(77)2 0(77) 


+ 


<F(? 7 ) $(77) 


OC —776 


Therefore, as 77 tends to 00 or — 00 , 7 ^( 77 ) —)■ 0, at the rate O {\ri\e ”2 


□ 


Now we give a proof of Proposition 

Proof. For quasicomplete separation, we denote C = ,'n.}\Q as the observations that 

can be completely separated, i.e., there exists a ( 70 , 7 ) pair that satishes strict inequalities in 


( 22 ) for all samples in C. For complete separation, C = { 1 ,..., n}, so we say that Q = 0. As 


to MLEs of linear predictors, by Albert and Anderson| (1984, Theorem 1,2), 


VM,i 


00 if Fj = 1, fiM,i —t —00 if 10 = 0, for all i E C. 
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By Lemma 1^ diagonal entries of J^niVAi) satisfy 

di = —t 0, for all z G C. 

In contrast, when Q is non-empty, there exists hnite and nniqne Q-Mi^m 

VM,i = diM + ~ dtM + i ^ Q- 


(A.5) 


Hence, both fiM,i and di are bonnded for i E Q. 

We hrst consider the gf-prior precision matrix nnder complete separation, where the cen¬ 


tering step (12) can be problematic. Althongh the original design matrix X_a 4 is hxed, in the 
resnlting centered design matrix X^, its jth colnmn 


= X, - 

is not nniqne, becanse the ratio di/{J2r=idr) niay take different valnes inside the interval 
[0,1], depending on which direction the coefficient vector diverges to inhnity. Despite this 
non-nniqneness, for complete separation, we can show that in the gf-prior precision matrix, 
all entries of Jn{$M) ~ X53(J7n(f)x)X^ converge to zero. For j = 1,... ,px, snppose that 
Xj = maxj|a:jj|, then \x^j\ < 2xj. For the (j, fc)-th entry of the precision, since Xj,Xk are 
hnite, we just need to show an upper bound of its absolute value converges to zero: 



Xfj„(fij^)Xl\ < 


n 


di\x 




< 


n 


di{2xj){2xk) —t 0. 


2=1 


2=1 


We next consider the ^f-prior precision matrix under quasicomplete separation. By (A.5), 


{di : i E Q} are bounded from below by a positive value. So the centering step (12) is 
well dehned, because the ratio di/{J2r=idr) is zero for i E C, and is positive (non-zero) 
for i E Q. Denote dfni'HM q) diagonal matrix formed by [di : i E Q}. Following the 
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above results about complete separation, for quasicompelte separation, we have J7n(/3x) = 
'^M whose rank is the same as qm = rank(X_A 4 ,Q). 

Last, we investigate the Bayes factor under quasicomplete separation. Since = 

Yl'i=i is positive, the behavior of the Wald statistic 


r "V - r "1 - r" 


i=l 


(A. 6 ) 


determines that of the Bayes factor (19). For each i G Q, (A.5) suggests that 17 ^^* is bounded, 
hence. 


^ ^ 'JP _ ( 

P “ 1 “ P ^ ^ f 

,'cn VZ^ 


di' 


i'&Q r 


, \ - 'Yhi'&Q di'fjM,i' 

Om + \ — VM,i --U-: 


is also bounded. Therefore, 


n 

QM = '^d{f]M,i) PMA4,i +'^d{fiM,i) PM^M,i ='^d{fiM,i) ^m^m, 


i&C 




,T „ 1 2 


ieQ 


ieQ 


is bounded, which leads to the conclusion that BF^vj.^vj^ is bounded in the limit. 


□ 


A.5 Proof of Proposition 

Proof. We hrst use proof by contradiction to show that for At, the MLE of the intercept is 
unique. If both (di,/3]^) and { 0 . 21 ^ 2 ) maximize the likelihood for model At, where di 7 ^ Q! 2 , 
then 

Olln + X_A4^i = d2ln + (®1 ~ ^ 2 )!^ = ^m{^ 2 ~ ^\)i 

which is contradicted with ^ O^m)- Similarly, we can show this MLE is the same as the 
one for model At', i.e., oim = dtM'- 
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By (23), between the two models M and M', 




So we jnst need to show Qm = Qm'- Since (Xm = (23) snggests that 


M ' 1 ^ 1) ■ • • 5 


n. 


Hence, 


/ ” T " ^ 


, 2=1 


where Wi = dr). Therefore, we have 


Qm — 




t7n(f7yV() 










— Qm' 


□ 


A.6 Proof of Proposition 

Proof. The marginal likelihood of the mixture of ^f-priors is obtained by integrating out g 
from the marginal likelihood of the ^f-prior, i.e.. 


POO 

p{Y\M)= piY \M,g)pig)dg 
Jo 
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Here p{Y \ A4, g) is obtained under the integrated Laplace approximation as in (17). Because 


of the one-to-one mapping between g and u, we rewrite this integral in terms of u. 


piY \M)= i p{Y \ M,u)p{u)d 


\u 


/o 


OC / p(Y I 6 

Jo 


V2 


exp ii;) 


- — 1/1 \ - — 1 —-V 

ti 2 ^[1 — vu)-^ e 2 “ 


B(|,|) 4i (|, r ,! f !,^, l - K ) [K + (1-K)rar <“'"<=1 


du 


= p(Y I A^)J'„(d^) 2 


V2 


( 4 ) 


Bill) 

1 a+P_M 1 / .. ‘’ + Qm , 

' u 2 (1 — VU) 2 2 ‘ 


[k -7 (1 — K,)vu]' 




du. 


Since the above integrand is proportional to a tCCH density ( |^ with updated parameters, 


the above integral equals B ((I’B I — k) v exp 


A.7 Proof of Proposition 

Proof. The marginal prior on /3^ after integrating g out is 


2v 


□ 


p{(3m I Af) oc / g exp 


^9 


1-1 


g+b 

1 \ — 


^ + 9 


exp 


S9 


W + 9). 


dg (A.7) 


We will show that as ||/3_A4iljn oo, both a lower bound and an upper bound of (A.7) are 


a+PM 


proportional to (11/3^4 IIy„) ^ • Since s > 0, a lower bound of of the right side of (A.7) is 


9 2 e 29 ^2 

/o \^ + 9 


g+b 

2 


dg = 


a + b ^ ^ a+PM-2 „ 

/ 1 . 

1 + d/ \9j \9, 


9 


Then according to the Watson’s Lemma ( |01ver|[l997| pp. 71), as ||/3x||j'„ —t oo, the limit of 
this lower bound is proportional to (||/3>illj„) 


°-+PM 


Next we hnd an upper bound of the 
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right side of (A.7) as 


P.M 

g 2 exp 




2-1 


2{1 + g) j \1 + g 


<2 + 6 

1 \ ^ 


exp 


sg 


W + 9)\ 


dg 


_ a + PM ^^ a + b s+||/3^||^^ 


According to Abramowitz and Stegun (1970) formula (13.1.4), 


iFi(a, 6 , s) = exp(s)s“ ^[l + (9(|s| ^)], when Real(s) > 0, 
1 [a) 


(A.8) 


hence as ||/3;v(||j„ —>■ oo, the limit of the above upper bound converges to 


exp 


11 / 3 , 


Mlijn 


I a + Pm , 

1 I -;- I exp 


S + ll/3xllL 


s+\\^m\\%^ 


+PM 


°-+PM 

« (II/3a^IIL) ' 


Therefore, as ||/3_yv( || increases, or equivalently, as ||/3_yv(|| increases, both the lower bound 

a+PM 

and upper bound of p{f3j^ \ M) are proportional to (||/3;v(||^^) ^ . □ 


A.8 Special Functions: Definition and Useful Properties 

We first review a list of special functions, including their definitions and relevant properties, 
that will be need in the proof of Proposition 


Confluent hypergeometric function (Abramowitz and Stegun 1970, eq 13.2.1): for 7 > a > 

0 , 


1 + 1 ( 0 :, 7 , x) = 


B (7 - 0 , 0 ) 7o 


u 


0.-1 


(1 - du. 


— By (Abramowitz and Stegun 1970, eq 13.2.27): 1 + 1 ( 0 , 7 ,x) = • 1 + 1(7 ~ ~^)- 


— By (Abramowitz and Stegun 1970, eq 6.5.12), the incomplete Gamma function: 


7 ( 0 , s)= / = i+i(a, a + 1 , —s) —. 

Jo ® 


57 









































iFi(q!,7,0) = 1. 


Confluent hypergeonietric function of two variables (Gordy 1998b for 7 > a > 0 and 

?/ < 1 , 


$i(a,/?,7,a;,|/) = 


B (7 - a, a) Jq 


u 


-\l-uy-^-\l-yu)-^e^^ du, 


Special cases: 

— If X = 0, then $ 1 ( 0 ;,/3, 7 , 0, ?/) = 2-^i(/5, a; 7 ; ?/)■ 

— If /S = 0 or j/ = 0, then $i(a, 0, 7 , x, y) = <hi(a, /9, 7 , x, 0) = $ 1 ( 0 :, 0, 7 , x, 0) = iFi{a, 7 , x). 

— If X = 0 and y = 0, then $ 1 ( 0 :, (3, 7 ,0, 0) = 1. 


Hypergeonietric function (Abramowitz and Stegun 1970, eq 15.3.1): for 7 > a > 0 


2Fi{/3,a-,'y,x) = 


B(7 - a, a) Jo 


u 


■^(1 —m)'^ “ ^(1 —xm) ^ du. 


— By (Abramowitz and Stegun|||1970 eq 15.3.3): in the dehnition of 2 ^^! function above, 


let w = 7 —^, then 

1 — XU ’ 


2 ^i(/ 5 ,a; 7 ;x) = ( 1 -x)^ ^ "" 2 Fi{'y - (3,'y - a-,'y] x) 


- 27^i(0,a;7,x) = 2i^i(/?, a; 7,0) = 1 

- 2Fi{/3,1](3,x) = (1 - x)-^2Fi{ 0,(3 - l](3,x) = (1 -x)"^ 


— By (Abramowitz and Stegun 1970, eq 15.3.4): 2 Ai(/i, a; 7 ; x) = (1—x) ^ 2 ^^! (/d, 7 — 7 , 


— By (Abramowitz and Stegun 1970, eq 15.3.5): 2Fi{/3,a;'y;x) = (1—x) " 27^1 (a, 7 —/5; 7 , y^) 


®Note: the definition in Gordy ( 1998 b I is slightly different from that in Gradshteyn and Ryzhik ( 20071 . 

























Hypergeo metric function of two variables (Appell function) (Weisstein|2009): for 7 > a > 0, 




B (7 — a, a) 


u 


OL —1 


(1 — " ^{1 — xu) ^{1 — yu) du. 


A.9 Proof of Proposition 

Proof. For simplicity, in this proof we omit the subscript At when there is no ambiguity. 
We hrst show part (1). In the tCCH distribution, if r = 0 or k = 1, then 


$1 



a + h s 



$1 



a + h 
2 



lAi 


h a + h s\ 
2 ’ 2 ■ 


Then the marginal likelihood becomes 


p(Y I Af) = 


p(Y I At 0 ) nz exp (^) "2 

lU (|. T'. i) io 1(1 - R2) + R2u]’^ 


du 


p(Y I At 0 ) nz exp (^) 

TD (a b\ TP (b a+b s N / 

^ U’ 2 ) 1^1 V 2 ’ do 


—I /-. — 1 — — 

2 (1 — vu)^ e 2 


[1 - (1 - ;) 


l-i?2 


+ 


R2/v 




pjY I Af 0 ) nz exp (£) 

TZ) (a b\ TP ( b < 1 +^ ^ ^ 

V2’ 2 ) 1-^1 V2’ 2 ’ 2i;j 


■q{ 9±P b\ ^ ( b 
V 2 ’ 2t I 2’ 2 ’ 


—1 a+b+p s 


R^/v 


2 ’ 21 ;’ 




n—1 
2 


= P(Y I Af, 


5 llzil 

V 2 ’ 2t \ 2’ 2 ’ 


b n—1 g+fe+p s 


R'^/v 


2 ’ 2 g’ 






n—1 
2 


Here the second last equality is given by the propriety of the tCCH density function (25) 
Then we show part (2). In the tCCH distribution, when s = 0, then 


^ , b a+h 


/ b a + b 
= 2 F 1 I r, 


du 
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Hence, the marginal likelihood becomes 


piY I M) = 


p(Y I M,) nt 


-l/v 


M2 ^(1 — VU) 2 


S -1 


( 2 ’ 2 ) 2 -P"l 2 j 1 io [(1 — i?2) +/22 m] 2 [/- -g (^1 _ 


du 

(A.9) 


For simplification, we denote x = 1 — 1/m and tc = 1 — (1 — mm)/(1 — a;MM). By change of 
variable, 

M = 




du 


1 — X 


v{l — X + xw)^ dw m( 1 — a; + a;w)2 ’ 


and the integral in (A.9) is 


“i/i) 


B+P-l,-. 

M2 (1 — VU) 2 


-1 


'0 [(1 — i 22 ) + i 22 M] 2 [ft; -|- (1 — 


du 


w 

a+p , 

2 

(1—tc)(l—ii>) 

--1 

2 1-01 

v{l—x+xw) 

1 —X+tCtD 

v{l—x+xw)‘^ 

( (1 — ^ 

n — 1 

2 ( ^ y 

1 v{l—x+xw) 

V l—x-\-xw ) 


dw 


b n-l-a-p 

[1 — X)2 V 2 


°+P —1/1 \£ 

ta 2 ^(1 — M!j 2 


--1 


r/-i T-10\ /1 /I ^ a+i)+p+l-Ti-2r ; 

[(1 —/22)m( 1 — x)] 2 (1 — x) 2 Jo 


_ (1-M2 )i,i;+M2 ; 

( 1 -M 2 )j;(a;- 1 ) 


n—1 
2 


(1 - ^1^) 


a+b+p+1—n—2r 
2 


a+p — 2r _ a+p , 

K 2 V ^ I a + p 0 

I 2 ’2 


(1-/22)^ 

[a + p a + h + p + l — n — 2r n — 1 a + h + p ^ (1 —/2^)m(1 — m) — /2^m 

V^~’ 2 ’ 2 ’ ^ (1 - i22)M 


□ 


dw 
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A. 10 Derivation of (41) 


Proof. Similar to we apply integrated Laplace approximation to obtain p{Y \ (f),AA.,g), 
then marginalize 0 out as follows. 


oc 

oc 


oc 


oc 


p{Y\M,g)= p(Y I 


/ p(Y I [0J'„(dx)] ^{l + g)~~e 2(i+s)0"M0 

Jo 

m&M)]-'- (1 + g)-“ r 


dcj) 


(1 + a)-'^ ' 


2(1+9) 


n 

- ti) - K^i) + b{ti 

i=l 


n — l 
2 




+ 2 X)i=l 11 (fi ~ ^i) ~ b{ti) + &(0j) 


n — l 
2 


Here, the last step replaces g with m = 1/(1 + g). 


□ 


A. 11 Proof of Model Selection Consistency 

We hrst show a lemma about a non-central distribution, which is useful to prove some 
of the following lemmas and theorems. Here the symbol xf{m) denotes a non-central x^ 
distribution with degrees of freedom k and non-centrality parameter m. 

Lemma 3. If a sequence of random variables {Xn ■ n = 1,2,...} have independent non¬ 
central distributions: ~ xli'^^^n), where random variables An —^ Oq € M"*" U {0}, then 

as n —)■ oo, Xnfn —^ oq. 

Proof. For any n G N, the characteristic function of Xn/n evaluated at t G M is 


cfxn/nit) = [E I An)] 

ZtAn 


= E/ 


exp 


1 — 2it/n 


(1 — 2it/n) 2 


{l-2it/n) 2 .Ea^ 


exp 


/ itAn \ 
\1 — 2it/n ) 
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Denote a complex valued random variable Bn = A„/(l — 2it/n). Since the limit of An is a 
constant, for the series {An : n G N}, convergence in distribution is equivalent to convergence 

p 

in probability. Because of the continuous mapping theorem, —)■ oq, or equivalently, 
convergence in distribution. Denote the bounded and continuous function h{Bn) = exp (itBn), 
then according to Portmanteau lemma, E [h{Bn)] —)-E [h{ao)] = h{ao). So for any f G M, 


lim (j)x„/n{t) = lim (1 — 2it/n) ■ lim E [h{Bn)] = h{ao) = exp (itao) , 

n^oo n^oo 


where the limit is the characteristic function of a degenerated distribution at Oq. Therefore, 
Xn/n converge in distribution to a constant oq, which implies convergence in probability. 

□ 


In order to show the asymptotic performance of the Bayes factor BFmt-Mi we first study 
asymptotic behaviors of the terms in the Bayes factors in the following lemmas. When testing 
nested models, the log likelihood ratio between M.j' and M. converges in distribution to a 
central (non-central) distribution, when the smaller (larger) model is true. The following 
lemma studies asymptotic behaviors of the likelihood ratio, which does not require models 
M. and M.'jp to be nested. 


Lemma 4. Denote the the likelihood ratio by 


A 


A4j^:A4 


A p{Y\aMT,PMT^-^T) 
p(Y|Q;^,^_^,Af) 



(A.IO) 


As the sample size n inereases, 

1) if Mt C M, then Amt-.m = Op{l). 

2) if AAt' t- AA, then A_x[^,_x[ = Op (e^-^”), where cm is a positive constant. 

Proof. In the hrst case where AA d JOlp, from the well-known results of likelihood ratio test, 
zm — zmt has a central chi-square distribution Therefore, the limiting distribution 
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of the log-likelihood ratio does not depend on n, i.e., Kmt-M = Op{l). 


In the second case where M. M.Ti we first examine the sub-case where M. C AAp- 


According to the power calculation results for GLM in Self et ah (1992) and Shieh (2000), 


when testing nested models, if the larger model is true, then we have that zmt~^m converges 
in distribution to a non-central of degrees of freedom The non-centrality 

parameter T is approximately 


« E (»lMr - »Im) - WImt) - mM)] . 


i=l 


where 6**^ = for i = 1,... ,n. By Talyor expansion, there exist a 9i between 6*^^, j 

and 9%,., such that 6(0*^) = 6(0*^J + b'i9X,^^^) (0*^^ - 0*^) + 6"(0~ ) (0*^^ - 9l^f/2. 
This combined with the assumption b"{-) > 0 gives that lim„_,.oo tk/n converges to a positive 
constant cm- Then by Lemma {zmt ~ ^m)/^ cm, and hence = Op{e^^^). 

In the case where M. and Aip are not nested, we introduce a third model M' which 
includes all the predictors in both M. and M-p- Using a similar method as in [Self et JT 


(1992), we can treat Ai' also as the true model (although with some redundant predictors) 
when comparing with M. and easily show that also has a non-central distribution. 

Hence we decompose Since both pairs {Aipy-M.') and {Ai' : M.) 

are nested models, we can apply the previous results twice: Amt-M' = Opil) and Am'-.m = 
Op{e^^^). Therefore, we can conclude that = Op{l) ■ = Op(e‘^^"). □ 

The Bayes factors contain the Wald statistics Qmt Qm ■ We next study their asymp¬ 
totic behaviors. 


Lemma 5. The Wald statistic Qm = Op{n^^), where 0 < ^m In particular, 

1 ) If Mt 7^ A 40 , then for any M D Mp, = 1- 

2 ) if Aip = Ai 0 , then for any model A4, ^m = 0- 
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Proof. For any Ai D A^r, we have shown in the proof of Lemma that the MLE 
converges in probability to the trne valne /3^, and J7n(/3x)/'^ is a hnite positive dehnite 
matrix and converges to X„(/3^)/n in probability. By Lemma and Slntsky’s theorem, we 


can rewrite the asymptotic normality (A.l) as 


Jn(/3. 


M) 


Pm - P. 


M 


N(0,I 


■PM^ 


Therefore, Qm = PmP^PPm)PM converges in distribntion to a non-central random vari¬ 
able with degrees of freedom pj^ and non-centrality parameter which is 0{n) 

if Pm ^ 0 , and zero otherwise. Since P*j^ = PXi.^ in the sense that all entries in f3*f^ that 
correspond to predictors not in are hlled with zero, P*j^ = 0 is eqnivalent to Air = Ai 0 . 
Therefore, by Lemmaif A4t 7 ^ A 40 , then Q_m = Op(n); if A4t = A 40 , then = Op(l). 

For any At p A4 t, since convergence in probability is preserved nnder addition and 
mnltiplication (Resnick 1999, pp. 175), we have Qm — PM>Jn{P m)P*m —^ f*’ i-®-’ Qm is 
most on the same order of JPPm)- Lemmawe have ^m = 7 m if P*m ^m = 0 


if P*M = 0- 


□ 


Based on the resnlts of Lemma the next lemma discnsses the asymptotic properties of 
^'mt-M^ a term that appears in the Bayes factor nnder the CH prior. 

Lemma 6 . Under the CH prior, denote the term in BFmt-.m- 


qCH 

^^Mt-M 


TD { ^~^PA4'jp h 

A i 2 ’2 


iFi I 


(a+pMrp a+b+pMrj 




P ( CL+PM b\ p / a+PM a+b+pM s+Qm \ 

V 2 ’ 2 / V 2 ’ 2 ’ 2 ) 


(A.ll) 


1 ) If AAt 7 ^ Ai 0 , then as n increases, 


^^Mt-M 


Op ( n 


^ A1P A/I “ P Ad 1 ^ ) 


PAA-PAAj^ 

Op (n 2 


if b is fixed, and s is fixed 
if b = 0 {n), or s = 0 {n) 
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In particular, if A4 D Air, then -m — Op [n ^ ) for all b and s. 


PM-PMrp 


2 ) If Air = A 40 , then as n increases, 




Op{l) if b is fixed, and s is fixed 

PAA ~PAd'j' 


Op (n 


if b = 0 {n), or s = 0 {n) 


Proof. We first show Case 1) where Aip 7 ^ Ai 0 , by Lemma ^Mt = 1- We consider the 
following three scenarios abont parameters b and s being hxed or 0 {n). 

Scenario 1: Both b,s are fixed. By Abramowitz and Stegnnj ([1970) formnla (13.1.5), 


iFi{a,b,s) = n (~’S) '^[1 + 0(I'5| ^)], when Real(s) < 0. (A.12) 

J. 10 Qj j 

Continnous mapping theorem snggests that for any model A4 whose Qm = Opinf-^), 


a+VMrp 

f o-+pmt ^ 2 




\ 2 J \ 2 


OC 


{s + Qmt) 


^+PMr, 


= Op ( n 


1.+PM a+PM ^ 

r(^^)(5±f^) " {s + Qm) " 


iMPM-PMT~°-^^~^M'> 


(A.13) 


Scenario 2: b is fixed, and s = 0{n). Since s + Qmt = 0{n) and s + Qm = 0{n), then 


PA 4 -PA 4 t 


by ( |A.13D , = Op[n 2 ' 

Scenario 3: b = 0(n). Lemma indicates that Qm is between Op(l) and Op{n). By 


Slater (1960) formnla (4.3.3): if b is large, and a,s are bonnded, then 


iFi(a, 6 , s) = 1 + 0 (| 6 | ) is bonnded; 


(A.14) 


and by Slater| (1960) formnlas (4.3.7): if b is large, s = by, and a,y are bonnded, then 


iFi(a, 6 ,s) = (1 - yf 


1 - 


a(a + 1 ) / y 


2 b \l-y 


+ 0 {\b\ 


-2^ 


is also bonnded. (A. 15) 
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Therefore, under the CH prior when parameter b = 0{n), 


n 


CH 


Tj b O'+VMrp b\ TP f “+PA1y a+b+pMrp s+Qm^\ 
R (R t“±EM a+b+PM S+Q^ \ 

^ \ 2 '2/ 1-^1 V 2 ’ 2 ’ 2 ) 





According to the Stirling’s Formula r(n) = e "'n" 2 ( 27 r) 2 (l + (9(n ^)), the above ratio be- 

/ pm-pmt \ 

comes Op ( n 2 j. 

Next we examine Case 2) where M.t = M. 0 . In this case, Lemma suggests that both 
Qmt Qm are on the same order Op(l). Hence in Scenario 1, where both h and s are 
hxed, = Op(l); In Scenario 2, since both s + Qmt s + Qm are on the order of 

Op{n), the same deviation and result as in Case 1) Scenario 2 apply. In Scenario 3, both 
5 + Qmt ^ + Qm cire Op(l) if s is fixed, and Op{n) if s = 0 (n), so the same derivation 
and result as in Case 1) Scenario 3 apply. 

□ 


Lemma 7. Under the robust prior, denote the term in BFmtM- 


'Ml 


R A f pm + ^ Y ^ _ 

\PMt + 1 ) 


+1 Qtk (p^ +1) 


7 


2 ’ 2(n+l) 


Q 


M 


ry ( VmM Qm {pmM) ^ 

' 1 2 ’ 2(n+l) ) 


(A.16) 


p , PM-PMj, 

As the sample size n increases, ^Mt-M ~ ^ 


Proof. By Abramowitz and Stegun (1970) formula (6.5.12), the incomplete Gamma fnnction 
7 ( 0 , s) = P~^e~^dt can be expressed using the iFi function 


7(0, s) = iFi(a, a + 1, -s) —. 

a 


(A.17) 
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Therefore, (A.16) becomes 


/ + 1 

\PMt + 1 


^ Qa4t^^__ 

P.M + 1 

q; ^ 




^ ^ j’ (PA^j’^1)'N ^ p ^ PAI^“t“l PAI Q AI'p (PA(^ “t“l)'N 

\ 2 ) 2(n+l) y 1-^1 ^ ^ , 2(n+l) ) 


( PAT+if ^ f Qa|(PAI + 1) 
V 2 2(n+l) 


PA4+1 

2 


p / PAI+1 PAI+3 _ Qm (PAI+1) ' 

2 ’ 2 ’ 2(n+l) J 


Since iFi(a, 6 ,0) = 1, and both QmtI^iQm/^ ^^^e bounded, the ratio between the iFi 
functions is bounded as n increases. Therefore we further simplify ^Mt-m (^ + 1) ^ = 

/ PAt-PAly\ 

Op in 2 j. This result holds no matter whether Aip = AI 0 or not. □ 

Lemma 8 . Under the intrinsic prior, denote the term in 


n 


I 


A 


( n+p;v4+l \ 2 2(n+pj^+l) p f PAIj + l l\ T 1 1 P2My+2 QA<r(PAIr+l) _ PAIy+l 

V PAI+1 J ^ \ 2 ^ 2 J \2^ 2 ’ 2 (n+pMr+l) ’ n 

PMj. (pAly+1) 

/ ^+PA!j+l \ 2 2^n+p^^+l) ry ( Pai+1 1'\ ^ 1 PA(+2 Qai(PA 1 + 1) PM+i^ 

V PAIy+l J t 2 ’ 2t 1^2’ 2 )2(n+p^+l)’ n ^ 


As the sample size n increases, ~ ^ ) ■ 

Proof. Since PmtiPm are bounded, and Qmt/pi Qm/p are bounded in probability, as n —)■ cx). 


n 


I 



PjVt 

/ n+p04+l \ 2 

V PAI+1 J 

PjOt'j' 

/ n+pA^j,+l \ 2 

y PA1^+1 j 


Op 


( PM-PMj. 

n 2 


Lemma 9. Under the local EB, denote the term in BFmt.m- 


□ 


max < exp 


Q 


LEB ^ 


Qjvirr \ ( Qm 




2 / ’ 


Pj'An 


{ PjOin-’ \ 

exp (- 


max 


/ \ - — 


\PM J 


(A.18) 


1) If M-t ^ AA 0 , then as n increases, = Op 


iMPM-PMj. 

n 2 ]. In particular, if 
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M D Mt, then = Op[n 2 


PM-PMrjp 


2) If Air = then as n increases, = Op (1) 


Proof. Case 2) is straightforward, because when M.p = Qmt — ^-p(I) Qm = Op(l). 


Now let us focus on Case 1). In (A.18), the numerator equals exp(—Q_ a4t/2 ) if and only if 


Qmt — VMti the denominator follows the same rule when we replacing M.p with M.. 
Since M.p ^ M 0 , Qm = Op{n) is greater than pj^ for large n. Hence the numerator 


PAin 


PMn 


of (A.18) is proportional to {Qmt/PMt) ^ G^p (—p_Mt/2) = Op{n 2 ^). For model M. 
whose Qm = Op{n^-^), if ^m > 0; then when n is large enough, Qm > Pm^ so the denominator 
is Op{n~^^ 2 ^). If ^M = 0; then the denominator is Op(l), which can also be written as 


^ / ^MPM ^ 

Op[n 2 j. 


□ 


We now examine the model selection consistency. 

Proof of Theorem [T] 

Proof. By Lemma Jn{.c^M) = Op{n'^-^), where 0 < tm < 1, and tm = 1 if Ad D Aip- 
Hence, 


ff71 (n. ) 
fJrfpLM) 


= Op n 2 


.IzIAi 


For the CH prior. 


BF 




\Jn i&A4T ) 

ffni&Ad') 


■ ^^t-.m ■ [1 + Op{l/n)]. 


(A.19) 


We first consider the case where both b and s are fixed, by using the results in Lemma and 
§ In the case where Aip ^ Ad^, for any non-true model Ad D Adj-, then pm > Pmt^ '^m = 1; 
and (m = hence 


BF_a 4 j,:A 4 — OpQ) ■ Op{l) ■ Op ( n 2 j . [1 -|- Op{l/n)] — > 00 . 
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On the other hand, ii M. ^ Air, then 


BF 




= Op (n 


‘PPLM 


■ Op ■ Op 


iMPM-PMrp 

n 2 ^ 




■ [1 + Op(l/n)] —)■ cx). 


In contrast, if Aip = then for any model Ai, since Ai D Aip, = 1- So the Bayes 
factor 

= Op (1) ■ Op (1) ■ Op (1) • [1 + Op(l/n)] 

is bonnded, which snggests the selection consistency does not hold when Aip = Ai^. 

Next consider the case where b = 0{n) or s = 0{n). For any model Ai Aip, the proof 
is similar as above. If D Air, then tm = 1 and pm > PMt^ 

BF_Mt:A 4 = Op(l) ■ Op{l) ■ Op in 2 j . [X _|_ Op{l/n)] —)• oo. 


which holds even when Aip = Ai 0 . 

For the robust prior, the intrinsic prior, and local EB, their Bayes factor are given by 
, with replaced by respectively. By Lemma|^|^ 

and the proofs are similar to the CH prior, hence omitted. □ 



A.12 Proof to Proposition!^ 


Proof, li b = 0{n) then by (A. 14) or (A. 15), 


E(l/^) = 


oc 


B(f+ 1,1-1) + 

rj/a b\ IP f a a+b 

Al [ 2 ^—^- 2 ) 

^ (I + I “ 1 ) , ^ 


(A.20) 


Bill) 


b -2 


= 0 {l/n). 
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If h is fixed and s = 0(n), then by (A. 12 ) and (A.20), 




□ 


A. 13 Proof of Proposition 


Proof. For the CH prior, according to (30), the conditional posterior of 2 ; = 1 — n is 


2 I Y.Af ~ CH I 

' 2 2 2 


(A. 21 ) 


and its characteristic fnnction is 




2:2 


-1 


(1 — 2 


(^+.<)z 




P ( b <^+b+PMj. ^+Qmj 


-dz = 


T-i / b ^~^b-\-pj^^ 

l-rU2> - 




TP r b <^+b+PMj. s+Qai^ \ 
- 2 -’ - 2 - ) 


Lemma 1^ shows that if Air 7 ^ At 0 , then s + Qmt = Op{n). li b = 0(1), then by (A. 8 ) and 
the continnos mapping theorem, for any f G M, as n goes in to inhnity. 




exp( 


5 +QAI 7 


+ it)-{ 




+ ity 


'^+PMr, 


exp( 


s+Qmh 


■)•( 


^+QAdT" \ — 


<^+PMn 


exp (ft). 


If 6 = 0(n), then nsing formnla (A.15), we can obtain the same limit. 
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For the robust prior, we examine the characteristic function oi u = 1 — z. Based on (35), 




M 2 ^ ^ du 




PA1y+l 

n+1 PMrp+^ ^ QMj,- 


u 


e du 


'0 


,, /PA1t+ 1 (QA1r-2M)(p^j, + l)^ ^ ^ PAIj+l 

">'1 2 > 2(n+l) ) f Qmt ~ 2*^^ ^ 


7 


2 ’ 2(n+l) 


Qa42 


Since Qa^j, = Op{n), for any fixed f G M, the ratio of the incomplete Gamma functions goes 

p 

to 1, and so does the second fraction. Therefore, 4>uit) —> 1, which is the characteristic 
function of the degenerate distribution at 0. 

For the intrinsic prior, by (28) and Table the conditional posterior of u is 


I AA A/f i-nrni I P-^T ^ ^ 1 1 Qmt ^ + PXr + l n + pxj, + l 

MI, J\Ax ~tGGff( , ,1, , , 

2 2 2 pmt + 1 n 


(A.22) 


and hence its characteristic function for any f G M is 


(j)u{t) = exp 


^ / 1 -I 

it{pMT + 1) 1 V2’ 2 ’ 


n + PMt + 1 


1 1 PAAy+2 (QaIj.— 2it)(p7K +1) PA1y+l 


2 (n+p^ + 1 ) ’ 


T / 1 1 PAAj’^2 QaA (PAA'p^1) PAAt^^I 

I 2’ 


2 ’ 2 (n+p^ + 1 ) ’ 


(A.23) 


Since Qmt — Opiji) and 


{Qmt ~ ‘^^P){PMt “I" ^) _ Qmt{PMt + 1) 
2 (m + Pmt + 1 ) 2 (n + Pmt + 1 ) 


0 , 


by continuous mapping theorem, the ratio of the two <Fi functions in (A.23) converges to one 

p 

in probability. Therefore, under the intrinsic prior, (t)u{t) —t 1. □ 
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A. 14 Proof of Theorem [2] 


Proof. For the CH prior, we will prove the BMA estimation consistency in two steps: 1) M.t ^ 
M .0 and 2) When M-j. ^ the model selection consistency always holds, so 

we just need to show the estimation consistency under the true model M.t- For notation 
simplicity, we denote According to (15) and (A. 21 ), the characteristic 

function of the posterior distribution \ is 


p{f3j^JMT,Y) d(3^^ 

= j p{f3^^\z,MT,Y) p{z\MT,Y)dz^di3^^ 

= J p{(3^^\z,Mt,Y) p{z\MT,Y)dz 

= J p(z\MT,Y)dz 


In the above calculation, the integrand e** has a bounded modulus, so according to 
Fubini’s Theorem, the two integrals (with respect to z and f3j^^) can be interchanged. Since 
Qmt — Op{n) and TinMr — Op{n~^), using methods similar to the proof of Proposition]^ 
and asymptotic normality of MLE, we can show that for any vector t. 


(*) 






On the other hand, when Aip = AI 0 , under the CH prior model selection consistency does 
not hold if both b and s are hxed. Hence we need to examine the limit of posterior distribution 
of f3ji^ under all models. Under any model Ai, the true model is nested in it, so the MLE 
of the coefficient f3^ converges to the true parameters 0 in probability as n goes to inhnity. 
Since the modulus of is bounded by a constant 1, which is integrable if regarded as a 

function of z, so according to the dominated convergence theorem, the characteristic function 
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of the posterior distribution p{f3j^ | Y, A^) evaluated at any vector t G is 

I j^^Y)dz 

,.pt^o-it^ot)l I Yi^Y)dz = 1. 


For the robust and intrinsic priors, model selection consistency always holds. So we just 


need to consider under M.t- Based on (35) and (A.22), proofs similar to the above proof of 


the CH prior can show that either 7 ^ Ad 0 or M.rp = the characteristic function of 
p{I3mt I Y) converges to e** or 1 in probability, respectively. □ 

B Test-Based Bayes Factors 

B.l Test-Based Bayes Factor under the ^f-Prior 

In Bayesian hypothesis testing, while the traditional Bayes factor computes the ratio between 
marginal likelihoods of data (referred to as data-based BF, or DBF in short), another type 
of Bayes factor, dehned as the ratio between marginal likelihoods of a test statistic, has also 


been introduced (Johnson 2005, 2008). In particular, based on the likelihood ratio statistic. 


the test-based Bayes factor (TBF) has been applied in model selection under the ^f-prior (Hu 
and Johnson 2009; Held et ah 2015| 2016), where models with high TBFs are preferable. 


To compute the TBF based on the likelihood ratio deviance zm (20), first, asymptotic 


theory (Davidson and Lever 1970) suggests that the limit distribution of zj \4 under the null 


model AJ 0 and under a local alternative model AJ are central and non-central Chi-squares: 


Zm \ M 0 I AJ ~ where Xm = 


Then, as p{zm \ ZA,f3j^) depends on f3j^ through the non-centrality parameter A^, integrat¬ 
ing out under it prior density yields the marginal likelihood p{zm \ ZA). Last, the TBF 
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is defined as the ratio 


TBF 


M:M0 — 


p(zm I M) f p{zm I PM^-^)Pi(^M I ■M)d(3M 


p{zm I Mf, 


p(zm I Mg 


(B.l) 


To conduct model selection in GLMs, Held et al. (2015) derive the TBF under the g- 


prior ([^, in whose density, appears in the format of Thus the conjugacy permits a 
tractable marginal likelihood p(z_m | M) as a Gamma distribution. Therefore, the resulting 


TBF has a closed form expression as in (21). 


B.2 Comparing Data-Based and Test-Based Bayes Factors 


The TBF (21) has a similar expression to the DBF (19). In fact, the two Bayes factors would 


be the same if z_m = Qm ^md = JniptM)- Naturally, it is interesting to examine 

how different the two Bayes factors are. 


We compare DBF (19) and TBF (21) empirically through a logistic regression toy example. 


with g = n and a single covariate generated from independent standard normal distributions. 
With the intercept set to a = 0.5, three scenarios are studied with different coefficients 
/S = 0,20/\/n, 2, which correspond to the null, local alternative, and alternative, respec¬ 
tively. To study asymptotics, various sample sizes n = 100,500,1000,5000 are taken. For 
each combination of f3 and n, 100 independent datasets are generated. To obtain an accu¬ 
rate approximation to the DBF, in addition to the integrated Laplace approximation (ILA) 


formula (19), we also implement importance sampling (IS), which can be viewed as a gold 


standard if the number of samples drawn is large. Here we draw m = 10000 samples 
independently from Student-t distributions with degrees of freedom 4, with location and scale 


parameters matching those in the corresponding conditional posteriors (15), (16). 


Figure 1^ shows that when the null or the local alternative is true, TBF (21) is asymp¬ 


totically the same as the DBF computed under either IS or ILA (19). In contrast, when 


the alternative is true, TBF differs from DBF by a relatively small but systematic amount. 
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Figure 3: From top to bottom: TBF versus DBF approximated by IS, DBF approximated by 
ILA vs DBF approximated by IS, and TBF versus DBF approximated by ILA. From left to 
right: the null, local alternative, and alternative hypotheses. 
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Figure 4: Wald statistic Qm versus the deviance zm- 


Comparison between the Wald statistic (18) and the deviance (20) suggests a similar 
phenomenon (Figure]^. They are asymptotically the same under the null or local alternative, 
but different under the alternative. 

In addition to the similarity between the two Bayes factors under ^f-priors, we notice that 
as a function of g, the test-based marginal likelihood would have the same kernel p{zm \ 
Ai) oc (1 -|-exp (—2;_yv(/[2(l + S')]) as its data-based counterpart (17) if z^ = Qm- 


Therefore, all empirical Bayes and fully Bayes approaches on g, discussed in Section |2.6| and 
Section can be readily applied to test-based methods with minimal changes. Held et ah 


(2015) apply local empirical Bayes, p{zm \ -Ai) = ma.Xg>op{zM \ g,-A4), and fully Bayes, 
p{zm I ■A4) = J p{zm I 9 ) ■M.)p{g)dg to compute marginal likelihoods for TBFs. However, we 
hnd that these optimized and integrated versions of TBF may no longer be coherent, in the 
sense that results change with the choice of the baseline model. Elaborating, when testing 
nested models Aii C AI2, 


TBF 






TBF 


M2:M0 


TBF 


vA/f p i.A^0 


if one computes the left hand side TBF under baseline A4i, but computes the right hand side 
TBFs under baseline A40. The main reason for this incoherence is that for model At, unlike 
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the data-based marginal likelihood which only depends on A4 itself, the test statistic zm also 
depends on the baseline model. On the other hand, coherence exists for the TBF under fixed 


g (21), since zm 2 -.Mi = 


zmi:M 0 - Hence, change of baseline models does not affect 


the results of the TBF under fixed g, which is also the case with the DBF. 


C Additional Simulation Examples 


We first include some additional results from the logistic regression simulation example that 


are examined in Section 5.1 (see Table C.l) and then introduce a different simulation study 


on Poisson regressions. 


Table C.l: Logistic regression simulation example: average size of selected models, out of 100 
realizations._ 


p 

20 

100 

p{M) 

Uniform 

Uniform 

BB(1,1) 

PMt 


0 


5 

10 



20 


5 


5 

r 

0 

0.75 

0 

0.75 

0 0.75 

0 

0.75 

0 

0.75 

0 

0.75 

CH(a = 1/2,6 = n) 

0 

0 

5 

4 

10 

8 

17 

13 

17 

15 

5 

3 

CH(a = 1, 6 = n) 

0 

0 

5 

5 

10 

8 

17 

13 

18 

15 

5 

3 

CH(a = 1/2,6 = n/2) 

0 

0 

6 

5 

10 

9 

17 

14 

25 

20 

5 

3 

CH(a = l,6 = n/2) 

0 

0 

6 

5 

10 

9 

17 

14 

26 

22 

5 

3 

Beta-prime 

0 

0 

5 

4 

10 

8 

17 

13 

19 

15 

5 

3 

ZS adapted 

0 

0 

5 

5 

10 

8 

17 

13 

18 

15 

5 

3 

Benchmark 

0 

0 

6 

6 

11 

10 

18 

15 

27 

24 

5 

3 

Robust 

0 

0 

6 

5 

11 

9 

18 

14 

34 

30 

21 

10 

Intrinsic 

0 

0 

6 

5 

11 

9 

18 

14 

32 

30 

14 

5 

Hyper-gf/u 

0 

1 

6 

5 

11 

10 

18 

15 

69 

56 

99 

80 

DBF, g = n 

0 

0 

5 

4 

9 

7 

15 

11 

7 

5 

5 

3 

TBF, g = n 

0 

0 

5 

4 

9 

7 

15 

11 

7 

5 

5 

3 

Jeffreys 

3 

3 

6 

6 

11 

10 

18 

15 

70 

60 

99 

91 

Hyper- 5 ( 

4 

4 

6 

6 

11 

10 

18 

15 

70 

61 

100 

93 

Uniform 

4 

4 

7 

6 

12 

10 

18 

15 

70 

61 

100 

97 

Local EB 

19 

19 

6 

6 

11 

10 

18 

15 

71 

60 

100 

96 

AIC 

3 

3 

8 

7 

12 

11 

18 

15 

34 

34 

6 

4 

BIG 

0 

0 

5 

4 

9 

7 

15 

11 

7 

5 

5 

3 


The simulation setup of the Poisson regression example is similar to that of the logistic 
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regression in Section (5.1). True values of coefficients (including the intercept) are set to 


one-fifth of those in the logistic regression, to avoid occasional extremely large values in Y. 


Tables C.2 C.4 display model selection and parameter estimation performance. Comparison 
among priors on leads to similar conclusions to the logistic regression example. For the 
Poisson regression, overall model selection accuracy is not as high as the logistic regression 
when M-t 7^ ■M .01 which is likely due to the smaller magnitude of coefficients. 


Table C.2: Poisson regression simulation example: number of times the true model are selected 
out of 100 realizations. Column-wise maximum is in bold type. 


p 

20 

100 

p{M) 

Uniform 

Uniform 

BB(1,1) 

Pj\At 

r 

0 

0 

0.75 

0 

5 

0.75 

0 

10 

0.75 

0 

20 

0.75 

0 

5 

0.75 

0 

5 

0.75 

CH(a= 1/2, 6 = n) 

94 

92 

10 

2 

10 

0 

0 

0 

2 

0 

1 

0 

CH(a = l,b = n) 

87 

89 

10 

2 

10 

0 

0 

0 

11 

1 

1 

0 

CH(a = 1/2,6 = n/2) 

91 

89 

11 

2 

10 

0 

0 

0 

3 

0 

1 

0 

CH(a= l,6 = n/2) 

82 

85 

11 

2 

9 

0 

0 

0 

5 

2 

2 

0 

Beta-prime 

94 

92 

10 

2 

10 

0 

0 

0 

7 

0 

1 

0 

ZS adapted 

87 

89 

10 

2 

11 

0 

0 

0 

6 

0 

1 

0 

Benchmark 

97 

93 

7 

0 

12 

1 

0 

0 

4 

0 

1 

0 

Robust 

91 

89 

9 

2 

11 

0 

0 

0 

1 

0 

3 

0 

Intrinsic 

85 

88 

8 

2 

12 

1 

0 

0 

1 

0 

3 

0 

Hyper-^f/n 

84 

87 

9 

0 

12 

1 

0 

0 

1 

0 

3 

0 

DBF, g = n 

84 

88 

7 

0 

8 

0 

0 

0 

11 

0 

1 

0 

TBF, g = n 

84 

88 

7 

0 

8 

0 

0 

0 

14 

0 

1 

0 

Jeffreys 

0 

0 

7 

1 

12 

1 

0 

0 

0 

0 

3 

0 

Uypei-g 

6 

7 

7 

0 

13 

1 

0 

0 

0 

0 

3 

0 

Uniform 

4 

2 

7 

0 

13 

1 

0 

0 

1 

1 

3 

0 

Local EB 

0 

0 

7 

0 

13 

1 

0 

0 

0 

0 

3 

0 

AIC 

4 

4 

3 

0 

6 

1 

1 

0 

0 

0 

8 

0 

BIC 

84 

88 

7 

0 

8 

0 

0 

0 

13 

1 

1 

0 
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Table C.3: Poisson regression simulation example: average size of selected models, out of 100 
realizations._ 


p 

20 

100 

p(M) 

Uniform 

Uniform 

BB(1,1) 

Pj\4t 

r 

0 

0 

0.75 

0 

5 

0.75 

10 

0 0.75 

0 

20 

0.75 

0 

5 

0.75 

0 

5 

0.75 

CH(a = 1/2,6 = n) 

0 

0 

4 

3 

9 

5 

13 

7 

12 

7 

3 

2 

CH(a = 1,6 = n) 

0 

0 

4 

3 

9 

5 

13 

7 

13 

8 

3 

2 

CH(a = 1/2,6 = n/2) 

0 

0 

5 

3 

9 

6 

13 

8 

16 

10 

3 

2 

CH(a = 1,6 = n/2) 

0 

0 

5 

3 

9 

6 

13 

8 

17 

10 

3 

2 

Beta-prime 

0 

0 

4 

3 

9 

5 

13 

7 

13 

7 

3 

2 

ZS adapted 

0 

0 

4 

3 

9 

5 

13 

7 

13 

7 

3 

2 

Benchmark 

0 

0 

5 

4 

10 

7 

14 

9 

17 

7 

3 

1 

Robust 

0 

0 

5 

3 

9 

6 

14 

8 

20 

14 

3 

2 

Intrinsic 

0 

0 

5 

3 

10 

6 

14 

8 

22 

13 

3 

2 

Hyper-gf/n 

0 

0 

5 

4 

9 

7 

14 

9 

24 

31 

3 

4 

DBF, g = n 

0 

0 

4 

2 

8 

5 

12 

6 

5 

3 

3 

2 

TBF, g = n 

0 

0 

4 

2 

8 

5 

12 

6 

6 

4 

3 

2 

Jeffreys 

2 

3 

5 

4 

10 

7 

14 

10 

29 

36 

3 

18 

Hyper- 5 ( 

3 

4 

5 

5 

10 

7 

15 

10 

30 

37 

3 

24 

Uniform 

4 

4 

6 

5 

10 

7 

15 

10 

30 

38 

3 

34 

Local EB 

19 

19 

5 

5 

10 

7 

15 

10 

32 

74 

3 

76 

AIC 

3 

3 

7 

6 

11 

8 

16 

11 

30 

28 

4 

2 

BIG 

0 

0 

4 

2 

8 

5 

12 

6 

5 

3 

3 

2 
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Table C.4: Poisson regression simulation example: 1000 times the average SSE = ~ 

of 100 realizations. Column-wise minimum is in bold type. 


V 

20 

100 

p(M) 

Uniform 

Uniform 

BB(1,1) 

PMt 

r 

0 

0 

0.75 

0 

5 

0.75 

0 

10 

0.75 

0 

20 

0.75 

0 

5 

0.75 

0 

5 

0.75 

CE{a = 1/2,b = n) 

5 

8 

24 

61 

34 

120 

58 

198 

66 

132 

37 

103 

CH(a = l,b = n) 

6 

9 

24 

61 

34 

120 

58 

197 

66 

134 

37 

98 

CH(a = 1/2,6 = n/2) 

7 

11 

24 

61 

33 

116 

56 

188 

75 

148 

36 

97 

CH(a = l,6 = n/2) 

7 

13 

24 

61 

33 

115 

55 

187 

77 

135 

36 

94 

Beta-prime 

5 

8 

24 

61 

34 

120 

58 

197 

66 

132 

37 

103 

ZS adapted 

6 

9 

24 

61 

34 

119 

55 

197 

66 

125 

37 

99 

Benchmark 

8 

18 

26 

65 

33 

108 

51 

170 

74 

150 

36 

133 

Robust 

7 

13 

25 

63 

33 

115 

51 

183 

88 

182 

36 

97 

Intrinsic 

8 

14 

25 

63 

33 

115 

51 

182 

90 

183 

35 

94 

Hyper-^f/n 

5 

12 

25 

65 

33 

109 

52 

172 

84 

162 

36 

97 

DBF, g = n 

5 

6 

25 

63 

37 

132 

68 

231 

40 

83 

39 

101 

TBF, g = n 

5 

6 

25 

63 

37 

132 

68 

231 

40 

84 

39 

101 

Jeffreys 

4 

9 

26 

67 

33 

108 

51 

169 

87 

165 

35 

97 

Hyper- 5 ( 

4 

7 

26 

68 

33 

108 

51 

168 

87 

164 

34 

112 

Uniform 

3 

7 

26 

70 

33 

108 

51 

168 

87 

164 

34 

121 

Local FB 

2 

4 

26 

71 

33 

108 

51 

168 

99 

256 

34 

222 

AIC 

17 

40 

28 

74 

34 

115 

46 

171 

120 

284 

37 

79 

BIG 

5 

6 

25 

63 

37 

132 

68 

231 

40 

84 

39 

100 
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